Neural Network Doc Summarization

dc.contributor.authorCheng, Junjieen
dc.date.accessioned2018-05-09T04:31:27Zen
dc.date.available2018-05-09T04:31:27Zen
dc.date.issued2018-05-07en
dc.description.abstractThis is the Neural Network Document Summarization project for the Multimedia, Hypertext, and Information Access (CS 4624) course at Virginia Tech in the 2018 Spring semester. The purpose of this project is to generate a summary from a long document through deep learning. As a result, the outcome of the project is expected to replace part of a human’s work. The implementation of this project consists of four phases: data preprocessing, building models, training, and testing. In the data preprocessing phase, the data set is separated into training set, validation set, and testing set, with the 3:1:1 ratio. In each data set, articles and abstracts are tokenized to tokens and then transformed to indexed documents. In the building model phase, a sequence to sequence model is implemented by PyTorch to transform articles to abstracts. The sequence to sequence model contains an encoder and a decoder. Both are implemented as recurrent neural network models with long-short term memory unit. Additionally, the MLP attention model is applied to the decoder model to improve its performance. In the training phase, the model iteratively loads data from the training set and learns from them. In each iteration, the model generates a summary according to the input document, and compares the generated summary with the real summary. The difference between them is represented by a loss value. According to the loss value, the model performs back propagation to improve its accuracy. In the testing phase, the validation dataset and the testing dataset are used to test the accuracy of the trained model. The model generates the summary according to the input document. Then the similarity between the generated summary and the real human-produced summary are evaluated by PyRouge. Throughout the semester, all of the above tasks were completed. With the trained model, users can generate CNN/Daily Mail style highlights according to an input article.en
dc.description.notesDocSummarization.zip contains all source code of the project, the training data set, and a trained model. The DocSummarizationReport of pdf and doc versions describes the project design and all technical details in this project. It also includes an user manual and a developer manual. The DocSummarizationPresentation of pdf and ppt versions is the slides used for the final presentation of the project. It shows the general design and phases of the project.en
dc.identifier.urihttp://hdl.handle.net/10919/83197en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectDeep learning (Machine learning)en
dc.subjectNatural Language Processingen
dc.subjectText Summarizationen
dc.subjectRecurrent Neural Networken
dc.subjectSequence to sequenceen
dc.titleNeural Network Doc Summarizationen
dc.typeDataseten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
DocSummarization.zip
Size:
160.86 MB
Format:
Loading...
Thumbnail Image
Name:
DocSummarizationPresentation.pdf
Size:
894.29 KB
Format:
Adobe Portable Document Format
Description:
Name:
DocSummarizationPresentation.pptx
Size:
735.72 KB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
DocSummarizationReport.pdf
Size:
3.4 MB
Format:
Adobe Portable Document Format
Name:
DocSummarizationReport.docx
Size:
3.99 MB
Format:
Microsoft Word XML
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: