Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks

Lin, Xiao

Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks

dc.contributor.author	Lin, Xiao	en
dc.contributor.committeechair	Parikh, Devi	en
dc.contributor.committeemember	Abbott, A. Lynn	en
dc.contributor.committeemember	Dhillon, Harpreet Singh	en
dc.contributor.committeemember	Tokekar, Pratap	en
dc.contributor.committeemember	Batra, Dhruv	en
dc.contributor.committeemember	Huang, Bert	en
dc.contributor.department	Electrical and Computer Engineering	en
dc.date.accessioned	2017-10-06T08:00:18Z	en
dc.date.available	2017-10-06T08:00:18Z	en
dc.date.issued	2017-10-05	en
dc.description.abstract	Learning and reasoning with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task (e.g., textual reasoning, matching images with captions), our system first represents input images and text in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input images and text. And then based on those perspectives, the system performs reasoning to make a joint prediction for the target task. Surprisingly, we show that interpreting textual assertions and scene descriptions in the modality of abstract scenes improves performance on various textual reasoning tasks, and interpreting images in the modality of Visual Question Answering improves performance on caption retrieval, which is a visual reasoning task. With grounding, imagination and question-answering approaches to interpret images and text in different modalities, we show that learning commonsense knowledge from multiple modalities effectively improves the performance of downstream vision and language tasks, improves interpretability of the model and is able to make more efficient use of training data. Complementary to the model aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA) where a model iteratively grows its knowledge through querying informative questions about images for answers. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven scoring function for deep VQA models under the Bayesian Neural Network framework. Once trained with a large initial training set, a deep VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations.	en
dc.description.abstractgeneral	Designing systems that learn and reason with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task, our system first represents the input information (e.g., images and text) in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input information. Based on those perspectives, the system performs reasoning to make a joint prediction to solve the target task. Perhaps surprisingly, we show that imagining (generating) abstract scenes behind input textual scene descriptions improves performance on various textual reasoning tasks such as answering fill-in-the-blank and paraphrasing questions, and answering questions about images improves performance on retrieving image captions. Through the use of perspectives from multiple modalities, our system also makes use of training data more efficiently and has a reasoning process that is easy to understand. Complementary to the system design aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA). VQA is the task of answering open-ended natural language questions about images. In active learning for VQA, a model iteratively grows its knowledge through querying informative questions about images for answers. Inspired by human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven query selection function. We show that once initialized with a large training set, a VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations.	en
dc.description.degree	Ph. D.	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:12992	en
dc.identifier.uri	http://hdl.handle.net/10919/79521	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Common Sense	en
dc.subject	Multimodal	en
dc.subject	Visual Question Answering	en
dc.subject	Image-Caption Ranking	en
dc.subject	Vision and Language	en
dc.subject	Active Learning	en
dc.title	Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Engineering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Lin_X_D_2017.pdf
Size:: 16.82 MB
Format:: Adobe Portable Document Format

Download

Name:: Lin_X_D_2017_support_1.pdf
Size:: 164.34 KB
Format:: Adobe Portable Document Format
Description:: Supporting documents

Download

Collections

Doctoral Dissertations