Commonsense for Zero-Shot Natural Language Video Localization

Holla, Meghana

Commonsense for Zero-Shot Natural Language Video Localization

dc.contributor.author	Holla, Meghana	en
dc.contributor.committeechair	Lourentzou, Ismini	en
dc.contributor.committeemember	Ramakrishnan, Narendran	en
dc.contributor.committeemember	Huang, Lifu	en
dc.contributor.department	Computer Science and Applications	en
dc.date.accessioned	2023-07-08T08:00:35Z	en
dc.date.available	2023-07-08T08:00:35Z	en
dc.date.issued	2023-07-07	en
dc.description.abstract	Zero-shot Natural Language-Video Localization (NLVL) has shown promising results in training NLVL models solely with raw video data through dynamic video segment proposal generation and pseudo-query annotations. However, existing pseudo-queries lack grounding in the source video and suffer from a lack of common ground due to their unstructured nature. In this work, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries through a commonsense enhancement module. Our approach employs Graph Convolutional Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query vectors prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that our model surpasses both zero-shot and weakly supervised baselines. These results underscore the significance of leveraging commonsense reasoning abilities in multimodal understanding tasks.	en
dc.description.abstractgeneral	Natural Language Video Localization (NLVL) is the task of retrieving relevant video segments from an untrimmed video given a user text query. To train an NLVL system, traditional methods demand annotations on the input videos, which include video segment spans (i.e., start and end timestamps) and the accompanying text query describing the segment. These annotations are laborious to collect for any domain and video length. To alleviate this, zero-shot NLVL methods generate the aforementioned annotations dynamically. However, current zero-shot NLVL approaches suffer from poor alignment between the video and the dynamically generated query, which can introduce noise in the localization process. To this end, this work aims to investigate the impact of implicit commonsensical knowledge, which humans innately possess, on zero-shot NLVL. We introduce CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries. Experiments on two benchmark datasets, containing diverse themes of videos, highlight the effectiveness of leveraging commonsense information.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:37519	en
dc.identifier.uri	http://hdl.handle.net/10919/115684	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Video Localization	en
dc.subject	Zero-shot Natural Language Video Localization	en
dc.subject	Commonsense	en
dc.subject	Multimodal Machine Learning	en
dc.subject	Vision and Language	en
dc.title	Commonsense for Zero-Shot Natural Language Video Localization	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Holla_M_T_2023.pdf
Size:: 3.1 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses