Dutta, Amartya2025-03-252025-03-252025-03-24vt_gsexam:42662https://hdl.handle.net/10919/125077Scene Graph Relationship Prediction aims to predict the interaction between the objects in an image. Despite the recent surge of interest in open-vocabulary and zero-shot SGG, most approaches still require some form of training or adaptation on the target dataset, even when using Vision-Language Models (VLMs). In this work, we propose a training-free framework for the VLMs to predict scene graph relationships. Our approach simply plugs VLMs into the pipeline without any fine-tuning, focusing on how to formulate relationship queries and aggregate predictions from the object pairs. To this end, we introduce two model-agnostic frameworks: SGRP-MC, a multiple-choice question answering (MCQA) approach, and SGRP-Open, an open-ended formulation. Evaluations on the PSG dataset reveal that well-scaled VLMs not only achieve competitive recall scores but also surpass most trained baselines by over 7% in mean recall, showcasing their strength in long-tail predicate predic- tion. Nonetheless, we identify several practical challenges: the large number of potential relationship candidates and the susceptibility of VLMs to choice ordering can affect con- sistency. Through our comparison of SGRP-MC and SGRP-Open, we highlight trade-offs in structured prediction performance between multiple-choice constraints and open-ended flexibility. Our findings establish that zero-shot scene graph relationship prediction is feasi- ble with a fully training-free VLM pipeline, laying the groundwork for leveraging large-scale foundation models for SGG without any additional fine-tuning.ETDenIn CopyrightVision Langugae ModelsScene GraphsZero ShotZero-Shot Scene Graph Relationship Prediction using VLMsThesis