Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks

dc.contributor.authorLin, Xiaoen
dc.contributor.committeechairParikh, Devien
dc.contributor.committeememberAbbott, A. Lynnen
dc.contributor.committeememberDhillon, Harpreet Singhen
dc.contributor.committeememberTokekar, Pratapen
dc.contributor.committeememberBatra, Dhruven
dc.contributor.committeememberHuang, Berten
dc.contributor.departmentElectrical and Computer Engineeringen
dc.date.accessioned2017-10-06T08:00:18Zen
dc.date.available2017-10-06T08:00:18Zen
dc.date.issued2017-10-05en
dc.description.abstractLearning and reasoning with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task (e.g., textual reasoning, matching images with captions), our system first represents input images and text in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input images and text. And then based on those perspectives, the system performs reasoning to make a joint prediction for the target task. Surprisingly, we show that interpreting textual assertions and scene descriptions in the modality of abstract scenes improves performance on various textual reasoning tasks, and interpreting images in the modality of Visual Question Answering improves performance on caption retrieval, which is a visual reasoning task. With grounding, imagination and question-answering approaches to interpret images and text in different modalities, we show that learning commonsense knowledge from multiple modalities effectively improves the performance of downstream vision and language tasks, improves interpretability of the model and is able to make more efficient use of training data. Complementary to the model aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA) where a model iteratively grows its knowledge through querying informative questions about images for answers. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven scoring function for deep VQA models under the Bayesian Neural Network framework. Once trained with a large initial training set, a deep VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations.en
dc.description.degreePh. D.en
dc.format.mediumETDen
dc.identifier.othervt_gsexam:12992en
dc.identifier.urihttp://hdl.handle.net/10919/79521en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectCommon Senseen
dc.subjectMultimodalen
dc.subjectVisual Question Answeringen
dc.subjectImage-Caption Rankingen
dc.subjectVision and Languageen
dc.subjectActive Learningen
dc.titleLeveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasksen
dc.typeDissertationen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Lin_X_D_2017.pdf
Size:
16.82 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
Lin_X_D_2017_support_1.pdf
Size:
164.34 KB
Format:
Adobe Portable Document Format
Description:
Supporting documents