Commonsense for Zero-Shot Natural Language Video Localization

dc.contributor.authorHolla, Meghanaen
dc.contributor.committeechairLourentzou, Isminien
dc.contributor.committeememberRamakrishnan, Narendranen
dc.contributor.committeememberHuang, Lifuen
dc.contributor.departmentComputer Science and Applicationsen
dc.date.accessioned2023-07-08T08:00:35Zen
dc.date.available2023-07-08T08:00:35Zen
dc.date.issued2023-07-07en
dc.description.abstractZero-shot Natural Language-Video Localization (NLVL) has shown promising results in training NLVL models solely with raw video data through dynamic video segment proposal generation and pseudo-query annotations. However, existing pseudo-queries lack grounding in the source video and suffer from a lack of common ground due to their unstructured nature. In this work, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries through a commonsense enhancement module. Our approach employs Graph Convolutional Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query vectors prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that our model surpasses both zero-shot and weakly supervised baselines. These results underscore the significance of leveraging commonsense reasoning abilities in multimodal understanding tasks.en
dc.description.abstractgeneralNatural Language Video Localization (NLVL) is the task of retrieving relevant video segments from an untrimmed video given a user text query. To train an NLVL system, traditional methods demand annotations on the input videos, which include video segment spans (i.e., start and end timestamps) and the accompanying text query describing the segment. These annotations are laborious to collect for any domain and video length. To alleviate this, zero-shot NLVL methods generate the aforementioned annotations dynamically. However, current zero-shot NLVL approaches suffer from poor alignment between the video and the dynamically generated query, which can introduce noise in the localization process. To this end, this work aims to investigate the impact of implicit commonsensical knowledge, which humans innately possess, on zero-shot NLVL. We introduce CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries. Experiments on two benchmark datasets, containing diverse themes of videos, highlight the effectiveness of leveraging commonsense information.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:37519en
dc.identifier.urihttp://hdl.handle.net/10919/115684en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectVideo Localizationen
dc.subjectZero-shot Natural Language Video Localizationen
dc.subjectCommonsenseen
dc.subjectMultimodal Machine Learningen
dc.subjectVision and Languageen
dc.titleCommonsense for Zero-Shot Natural Language Video Localizationen
dc.typeThesisen
thesis.degree.disciplineComputer Science and Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Holla_M_T_2023.pdf
Size:
3.1 MB
Format:
Adobe Portable Document Format

Collections