Zero-Shot Scene Graph Relationship Prediction using VLMs

Dutta, Amartya

Zero-Shot Scene Graph Relationship Prediction using VLMs

dc.contributor.author	Dutta, Amartya	en
dc.contributor.committeechair	Karpatne, Anuj	en
dc.contributor.committeemember	Thomas, Christopher Lee	en
dc.contributor.committeemember	Lourentzou, Ismini	en
dc.contributor.department	Computer Science and#38; Applications	en
dc.date.accessioned	2025-03-25T08:00:33Z	en
dc.date.available	2025-03-25T08:00:33Z	en
dc.date.issued	2025-03-24	en
dc.description.abstract	Scene Graph Relationship Prediction aims to predict the interaction between the objects in an image. Despite the recent surge of interest in open-vocabulary and zero-shot SGG, most approaches still require some form of training or adaptation on the target dataset, even when using Vision-Language Models (VLMs). In this work, we propose a training-free framework for the VLMs to predict scene graph relationships. Our approach simply plugs VLMs into the pipeline without any fine-tuning, focusing on how to formulate relationship queries and aggregate predictions from the object pairs. To this end, we introduce two model-agnostic frameworks: SGRP-MC, a multiple-choice question answering (MCQA) approach, and SGRP-Open, an open-ended formulation. Evaluations on the PSG dataset reveal that well-scaled VLMs not only achieve competitive recall scores but also surpass most trained baselines by over 7% in mean recall, showcasing their strength in long-tail predicate predic- tion. Nonetheless, we identify several practical challenges: the large number of potential relationship candidates and the susceptibility of VLMs to choice ordering can affect con- sistency. Through our comparison of SGRP-MC and SGRP-Open, we highlight trade-offs in structured prediction performance between multiple-choice constraints and open-ended flexibility. Our findings establish that zero-shot scene graph relationship prediction is feasi- ble with a fully training-free VLM pipeline, laying the groundwork for leveraging large-scale foundation models for SGG without any additional fine-tuning.	en
dc.description.abstractgeneral	Can Vision-Language Models (VLMs) predict relationships between objects in an image without any training? We introduce a fully training-free approach to Scene Graph Relation- ship Prediction by directly using VLMs. Our method explores two strategies: SGRP-MC, which frames the task as a multiple-choice question, and SGRP-Open, which allows for open-ended responses. Testing on the PSG dataset shows that large VLMs not only achieve competitive results but also outperform many trained models in mean recall, especially for rare relationships. However, challenges remain, such as handling numerous relationship op- tions and maintaining consistency in predictions. By comparing structured (SGRP-MC) and open-ended (SGRP-Open) approaches, we highlight key trade-offs, demonstrating that zero-shot scene graph prediction is both possible and effective, opening new directions for VLMs in structured visual understanding.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:42662	en
dc.identifier.uri	https://hdl.handle.net/10919/125077	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Vision Langugae Models	en
dc.subject	Scene Graphs	en
dc.subject	Zero Shot	en
dc.title	Zero-Shot Scene Graph Relationship Prediction using VLMs	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dutta_A_T_2025.pdf
Size:: 3.1 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses