Data Augmentation with Seq2Seq Models

Granstedt, Jason Louis

Data Augmentation with Seq2Seq Models

dc.contributor.author	Granstedt, Jason Louis	en
dc.contributor.committeechair	Batra, Dhruv	en
dc.contributor.committeemember	Baumann, William T.	en
dc.contributor.committeemember	Huang, Bert	en
dc.contributor.department	Electrical and Computer Engineering	en
dc.date.accessioned	2017-07-07T08:00:35Z	en
dc.date.available	2017-07-07T08:00:35Z	en
dc.date.issued	2017-07-06	en
dc.description.abstract	Paraphrase sparsity is an issue that complicates the training process of question answering systems: syntactically diverse but semantically equivalent sentences can have significant disparities in predicted output probabilities. We propose a method for generating an augmented paraphrase corpus for the visual question answering system to make it more robust to paraphrases. This corpus is generated by concatenating two sequence to sequence models. In order to generate diverse paraphrases, we sample the neural network using diverse beam search. We evaluate the results on the standard VQA validation set. Our approach results in a significantly expanded training dataset and vocabulary size, but has slightly worse performance when tested on the validation split. Although not as fruitful as we had hoped, our work highlights additional avenues for investigation into selecting more optimal model parameters and the development of a more sophisticated paraphrase filtering algorithm. The primary contribution of this work is the demonstration that decent paraphrases can be generated from sequence to sequence models and the development of a pipeline for developing an augmented dataset.	en
dc.description.abstractgeneral	For a machine, processing language is hard. All possible combinations of words in a language far exceed a computer’s ability to directly memorize them. Thus, generalizing language into a form that a computer can reason with is necessary for a machine to understand raw human input. Various advancements in machine learning have been particularly impressive in this regard. However, they require a corpus, or a body of information, in order to learn. Collecting this corpus is typically expensive and time consuming, and does not necessarily contain all of the information that a system would need to know - the machine would not know how to handle a word that it has never seen before, for example. This thesis examines the possibility of using a large, general corpus to expand the vocabulary size of a specialized corpus in order to improve performance on a specific task. We use Seq2Seq models, a recent development in neural networks that has seen great success in translation tasks to do so. The Seq2Seq model is trained on the general corpus to learn the language and then applied to the specialized corpus to generate paraphrases similar to the format in the specialized corpus. We were able to significantly expand the volume and vocabulary size of the specialized corpus via this approach, we have demonstrated that decent paraphrases can be generated from Seq2Seq models, and we developed a pipeline for augmenting other specialized datasets.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:10139	en
dc.identifier.uri	http://hdl.handle.net/10919/78315	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Data Augmentation	en
dc.subject	Seq2Seq	en
dc.subject	Diverse Beam Search	en
dc.subject	VQA	en
dc.title	Data Augmentation with Seq2Seq Models	en
dc.type	Thesis	en
thesis.degree.discipline	Electrical Engineering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Granstedt_JL_T_2017.pdf
Size:: 1.78 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses