Zero-Shot Scene Graph Relationship Prediction using VLMs

dc.contributor.authorDutta, Amartyaen
dc.contributor.committeechairKarpatne, Anujen
dc.contributor.committeememberThomas, Christopher Leeen
dc.contributor.committeememberLourentzou, Isminien
dc.contributor.departmentComputer Science and#38; Applicationsen
dc.date.accessioned2025-03-25T08:00:33Zen
dc.date.available2025-03-25T08:00:33Zen
dc.date.issued2025-03-24en
dc.description.abstractScene Graph Relationship Prediction aims to predict the interaction between the objects in an image. Despite the recent surge of interest in open-vocabulary and zero-shot SGG, most approaches still require some form of training or adaptation on the target dataset, even when using Vision-Language Models (VLMs). In this work, we propose a training-free framework for the VLMs to predict scene graph relationships. Our approach simply plugs VLMs into the pipeline without any fine-tuning, focusing on how to formulate relationship queries and aggregate predictions from the object pairs. To this end, we introduce two model-agnostic frameworks: SGRP-MC, a multiple-choice question answering (MCQA) approach, and SGRP-Open, an open-ended formulation. Evaluations on the PSG dataset reveal that well-scaled VLMs not only achieve competitive recall scores but also surpass most trained baselines by over 7% in mean recall, showcasing their strength in long-tail predicate predic- tion. Nonetheless, we identify several practical challenges: the large number of potential relationship candidates and the susceptibility of VLMs to choice ordering can affect con- sistency. Through our comparison of SGRP-MC and SGRP-Open, we highlight trade-offs in structured prediction performance between multiple-choice constraints and open-ended flexibility. Our findings establish that zero-shot scene graph relationship prediction is feasi- ble with a fully training-free VLM pipeline, laying the groundwork for leveraging large-scale foundation models for SGG without any additional fine-tuning.en
dc.description.abstractgeneralCan Vision-Language Models (VLMs) predict relationships between objects in an image without any training? We introduce a fully training-free approach to Scene Graph Relation- ship Prediction by directly using VLMs. Our method explores two strategies: SGRP-MC, which frames the task as a multiple-choice question, and SGRP-Open, which allows for open-ended responses. Testing on the PSG dataset shows that large VLMs not only achieve competitive results but also outperform many trained models in mean recall, especially for rare relationships. However, challenges remain, such as handling numerous relationship op- tions and maintaining consistency in predictions. By comparing structured (SGRP-MC) and open-ended (SGRP-Open) approaches, we highlight key trade-offs, demonstrating that zero-shot scene graph prediction is both possible and effective, opening new directions for VLMs in structured visual understanding.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:42662en
dc.identifier.urihttps://hdl.handle.net/10919/125077en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectVision Langugae Modelsen
dc.subjectScene Graphsen
dc.subjectZero Shoten
dc.titleZero-Shot Scene Graph Relationship Prediction using VLMsen
dc.typeThesisen
thesis.degree.disciplineComputer Science & Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Dutta_A_T_2025.pdf
Size:
3.1 MB
Format:
Adobe Portable Document Format

Collections