Big Data Text Summarization: Using Deep Learning to Summarize Theses and Dissertations

Ahuja, Naman; Bansal, Ritesh; Ingram, William A.; Jude, Palakh; Kahu, Sampanna; Wang, Xinyue

Big Data Text Summarization: Using Deep Learning to Summarize Theses and Dissertations

dc.contributor.author	Ahuja, Naman	en
dc.contributor.author	Bansal, Ritesh	en
dc.contributor.author	Ingram, William A.	en
dc.contributor.author	Jude, Palakh	en
dc.contributor.author	Kahu, Sampanna	en
dc.contributor.author	Wang, Xinyue	en
dc.date.accessioned	2018-12-14T23:06:28Z	en
dc.date.available	2018-12-14T23:06:28Z	en
dc.date.issued	2018-12-05	en
dc.description.abstract	Team 16 in the fall 2018 course "CS 4984/5984 Big Data Text Summarization," in partnership with the University Libraries and the Digital Library Research Laboratory, prepared a corpus of electronic theses and dissertations (ETDs) for students to study natural language processing with the power of state-of-the-art deep learning technology. The ETD corpus is made up of 13,071 doctoral dissertations and 17,890 master theses downloaded from the University Libraries’ VTechWorks system. This particular study is designed to explore big data summarization for ETDs, which is a relatively under-explored area. The result of the project will help to address the difficulty of information extraction from ETD documents, the potential of transfer learning on automatic summarization of ETD chapters, and the quality of state-of-the-art deep learning summarization technologies when applied to the ETD corpus. The goal of this project is to generate chapter level abstractive summaries for an ETD collection through deep learning. Major challenges of the project include accurately extracting well-formatted chapter text from PDF files, and the lack of labeled data for supervised deep learning models. For PDF processing, we compare two state of the art scholarly PDF data extraction tools, Grobid and Science-Parse, which generate structured documents from which we can further extract metadata and chapter level text. For the second challenge, we perform transfer learning by training supervised learning models on a labeled dataset of Wikipedia articles related to the ETD collection. Our experimental models include Sequence-to-Sequence and Pointer Generator summarization models. Besides supervised models, we also experiment with an unsupervised reinforcement model, Fast Abstractive Summarization-RL. The general pipeline for our experiments consists of the following steps: PDF data processing and chapter extraction, collecting a training data set of Wikipedia articles, manually creating human generated gold standard summaries for testing and validation, building deep learning models for chapter summarization, evaluating and tuning the models based on results, and then iteratively refining the whole process.	en
dc.description.notes	Contents: -- ETDSummarizationReport.pdf - primary report document ETDSummarizationReport.zip - LaTeX project used to generate the report ETDSummarizationPresentation.pdf - final presentation PDF ETDSummarizationPresentation.pptx - final presentation as PowerPoint ETDSummarizationSourceCode.tar.gz - source code archive	en
dc.identifier.uri	http://hdl.handle.net/10919/86406	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/us/	en
dc.subject	natural language processing	en
dc.subject	computational linguistics	en
dc.subject	text summarization	en
dc.subject	text mining	en
dc.subject	unstructured data	en
dc.subject	unstructured data analytics	en
dc.subject	electronic theses and dissertations	en
dc.subject	ETDs	en
dc.title	Big Data Text Summarization: Using Deep Learning to Summarize Theses and Dissertations	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en