Zero-Shot Scene Graph Relationship Prediction using VLMs
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Scene Graph Relationship Prediction aims to predict the interaction between the objects in an image. Despite the recent surge of interest in open-vocabulary and zero-shot SGG, most approaches still require some form of training or adaptation on the target dataset, even when using Vision-Language Models (VLMs). In this work, we propose a training-free framework for the VLMs to predict scene graph relationships. Our approach simply plugs VLMs into the pipeline without any fine-tuning, focusing on how to formulate relationship queries and aggregate predictions from the object pairs. To this end, we introduce two model-agnostic frameworks: SGRP-MC, a multiple-choice question answering (MCQA) approach, and SGRP-Open, an open-ended formulation. Evaluations on the PSG dataset reveal that well-scaled VLMs not only achieve competitive recall scores but also surpass most trained baselines by over 7% in mean recall, showcasing their strength in long-tail predicate predic- tion. Nonetheless, we identify several practical challenges: the large number of potential relationship candidates and the susceptibility of VLMs to choice ordering can affect con- sistency. Through our comparison of SGRP-MC and SGRP-Open, we highlight trade-offs in structured prediction performance between multiple-choice constraints and open-ended flexibility. Our findings establish that zero-shot scene graph relationship prediction is feasi- ble with a fully training-free VLM pipeline, laying the groundwork for leveraging large-scale foundation models for SGG without any additional fine-tuning.