Zero-Shot Scene Graph Relationship Prediction using VLMs
dc.contributor.author | Dutta, Amartya | en |
dc.contributor.committeechair | Karpatne, Anuj | en |
dc.contributor.committeemember | Thomas, Christopher Lee | en |
dc.contributor.committeemember | Lourentzou, Ismini | en |
dc.contributor.department | Computer Science and#38; Applications | en |
dc.date.accessioned | 2025-03-25T08:00:33Z | en |
dc.date.available | 2025-03-25T08:00:33Z | en |
dc.date.issued | 2025-03-24 | en |
dc.description.abstract | Scene Graph Relationship Prediction aims to predict the interaction between the objects in an image. Despite the recent surge of interest in open-vocabulary and zero-shot SGG, most approaches still require some form of training or adaptation on the target dataset, even when using Vision-Language Models (VLMs). In this work, we propose a training-free framework for the VLMs to predict scene graph relationships. Our approach simply plugs VLMs into the pipeline without any fine-tuning, focusing on how to formulate relationship queries and aggregate predictions from the object pairs. To this end, we introduce two model-agnostic frameworks: SGRP-MC, a multiple-choice question answering (MCQA) approach, and SGRP-Open, an open-ended formulation. Evaluations on the PSG dataset reveal that well-scaled VLMs not only achieve competitive recall scores but also surpass most trained baselines by over 7% in mean recall, showcasing their strength in long-tail predicate predic- tion. Nonetheless, we identify several practical challenges: the large number of potential relationship candidates and the susceptibility of VLMs to choice ordering can affect con- sistency. Through our comparison of SGRP-MC and SGRP-Open, we highlight trade-offs in structured prediction performance between multiple-choice constraints and open-ended flexibility. Our findings establish that zero-shot scene graph relationship prediction is feasi- ble with a fully training-free VLM pipeline, laying the groundwork for leveraging large-scale foundation models for SGG without any additional fine-tuning. | en |
dc.description.abstractgeneral | Can Vision-Language Models (VLMs) predict relationships between objects in an image without any training? We introduce a fully training-free approach to Scene Graph Relation- ship Prediction by directly using VLMs. Our method explores two strategies: SGRP-MC, which frames the task as a multiple-choice question, and SGRP-Open, which allows for open-ended responses. Testing on the PSG dataset shows that large VLMs not only achieve competitive results but also outperform many trained models in mean recall, especially for rare relationships. However, challenges remain, such as handling numerous relationship op- tions and maintaining consistency in predictions. By comparing structured (SGRP-MC) and open-ended (SGRP-Open) approaches, we highlight key trade-offs, demonstrating that zero-shot scene graph prediction is both possible and effective, opening new directions for VLMs in structured visual understanding. | en |
dc.description.degree | Master of Science | en |
dc.format.medium | ETD | en |
dc.identifier.other | vt_gsexam:42662 | en |
dc.identifier.uri | https://hdl.handle.net/10919/125077 | en |
dc.language.iso | en | en |
dc.publisher | Virginia Tech | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | Vision Langugae Models | en |
dc.subject | Scene Graphs | en |
dc.subject | Zero Shot | en |
dc.title | Zero-Shot Scene Graph Relationship Prediction using VLMs | en |
dc.type | Thesis | en |
thesis.degree.discipline | Computer Science & Applications | en |
thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
thesis.degree.level | masters | en |
thesis.degree.name | Master of Science | en |
Files
Original bundle
1 - 1 of 1