Neural Network Doc Summarization

Cheng, Junjie

Neural Network Doc Summarization

dc.contributor.author	Cheng, Junjie	en
dc.date.accessioned	2018-05-09T04:31:27Z	en
dc.date.available	2018-05-09T04:31:27Z	en
dc.date.issued	2018-05-07	en
dc.description.abstract	This is the Neural Network Document Summarization project for the Multimedia, Hypertext, and Information Access (CS 4624) course at Virginia Tech in the 2018 Spring semester. The purpose of this project is to generate a summary from a long document through deep learning. As a result, the outcome of the project is expected to replace part of a human’s work. The implementation of this project consists of four phases: data preprocessing, building models, training, and testing. In the data preprocessing phase, the data set is separated into training set, validation set, and testing set, with the 3:1:1 ratio. In each data set, articles and abstracts are tokenized to tokens and then transformed to indexed documents. In the building model phase, a sequence to sequence model is implemented by PyTorch to transform articles to abstracts. The sequence to sequence model contains an encoder and a decoder. Both are implemented as recurrent neural network models with long-short term memory unit. Additionally, the MLP attention model is applied to the decoder model to improve its performance. In the training phase, the model iteratively loads data from the training set and learns from them. In each iteration, the model generates a summary according to the input document, and compares the generated summary with the real summary. The difference between them is represented by a loss value. According to the loss value, the model performs back propagation to improve its accuracy. In the testing phase, the validation dataset and the testing dataset are used to test the accuracy of the trained model. The model generates the summary according to the input document. Then the similarity between the generated summary and the real human-produced summary are evaluated by PyRouge. Throughout the semester, all of the above tasks were completed. With the trained model, users can generate CNN/Daily Mail style highlights according to an input article.	en
dc.description.notes	DocSummarization.zip contains all source code of the project, the training data set, and a trained model. The DocSummarizationReport of pdf and doc versions describes the project design and all technical details in this project. It also includes an user manual and a developer manual. The DocSummarizationPresentation of pdf and ppt versions is the slides used for the final presentation of the project. It shows the general design and phases of the project.	en
dc.identifier.uri	http://hdl.handle.net/10919/83197	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Deep learning (Machine learning)	en
dc.subject	Natural Language Processing	en
dc.subject	Text Summarization	en
dc.subject	Recurrent Neural Network	en
dc.subject	Sequence to sequence	en
dc.title	Neural Network Doc Summarization	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en