Commonsense for Zero-Shot Natural Language Video Localization

Holla, Meghana

Commonsense for Zero-Shot Natural Language Video Localization

Files

Holla_M_T_2023.pdf (3.1 MB)

Downloads: 496

Date

2023-07-07

Authors

Holla, Meghana

Publisher

Virginia Tech

Abstract

Zero-shot Natural Language-Video Localization (NLVL) has shown promising results in training NLVL models solely with raw video data through dynamic video segment proposal generation and pseudo-query annotations. However, existing pseudo-queries lack grounding in the source video and suffer from a lack of common ground due to their unstructured nature. In this work, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries through a commonsense enhancement module. Our approach employs Graph Convolutional Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query vectors prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that our model surpasses both zero-shot and weakly supervised baselines. These results underscore the significance of leveraging commonsense reasoning abilities in multimodal understanding tasks.

Keywords

Video Localization, Zero-shot Natural Language Video Localization, Commonsense, Multimodal Machine Learning, Vision and Language

Persistent link

http://hdl.handle.net/10919/115684

Collections

Masters Theses

Full item page

Commonsense for Zero-Shot Natural Language Video Localization

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections