VTechWorks staff will be away for the Thanksgiving holiday beginning at noon on Wednesday, November 27, through Friday, November 29. We will resume normal operations on Monday, December 2. Thank you for your patience.
 

Big Data Text Summarization: Using Deep Learning to Summarize Theses and Dissertations

dc.contributor.authorAhuja, Namanen
dc.contributor.authorBansal, Riteshen
dc.contributor.authorIngram, William A.en
dc.contributor.authorJude, Palakhen
dc.contributor.authorKahu, Sampannaen
dc.contributor.authorWang, Xinyueen
dc.date.accessioned2018-12-14T23:06:28Zen
dc.date.available2018-12-14T23:06:28Zen
dc.date.issued2018-12-05en
dc.description.abstractTeam 16 in the fall 2018 course "CS 4984/5984 Big Data Text Summarization," in partnership with the University Libraries and the Digital Library Research Laboratory, prepared a corpus of electronic theses and dissertations (ETDs) for students to study natural language processing with the power of state-of-the-art deep learning technology. The ETD corpus is made up of 13,071 doctoral dissertations and 17,890 master theses downloaded from the University Libraries’ VTechWorks system. This particular study is designed to explore big data summarization for ETDs, which is a relatively under-explored area. The result of the project will help to address the difficulty of information extraction from ETD documents, the potential of transfer learning on automatic summarization of ETD chapters, and the quality of state-of-the-art deep learning summarization technologies when applied to the ETD corpus. The goal of this project is to generate chapter level abstractive summaries for an ETD collection through deep learning. Major challenges of the project include accurately extracting well-formatted chapter text from PDF files, and the lack of labeled data for supervised deep learning models. For PDF processing, we compare two state of the art scholarly PDF data extraction tools, Grobid and Science-Parse, which generate structured documents from which we can further extract metadata and chapter level text. For the second challenge, we perform transfer learning by training supervised learning models on a labeled dataset of Wikipedia articles related to the ETD collection. Our experimental models include Sequence-to-Sequence and Pointer Generator summarization models. Besides supervised models, we also experiment with an unsupervised reinforcement model, Fast Abstractive Summarization-RL. The general pipeline for our experiments consists of the following steps: PDF data processing and chapter extraction, collecting a training data set of Wikipedia articles, manually creating human generated gold standard summaries for testing and validation, building deep learning models for chapter summarization, evaluating and tuning the models based on results, and then iteratively refining the whole process.en
dc.description.notesContents: -- ETDSummarizationReport.pdf - primary report document ETDSummarizationReport.zip - LaTeX project used to generate the report ETDSummarizationPresentation.pdf - final presentation PDF ETDSummarizationPresentation.pptx - final presentation as PowerPoint ETDSummarizationSourceCode.tar.gz - source code archiveen
dc.identifier.urihttp://hdl.handle.net/10919/86406en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/en
dc.subjectnatural language processingen
dc.subjectcomputational linguisticsen
dc.subjecttext summarizationen
dc.subjecttext miningen
dc.subjectunstructured dataen
dc.subjectunstructured data analyticsen
dc.subjectelectronic theses and dissertationsen
dc.subjectETDsen
dc.titleBig Data Text Summarization: Using Deep Learning to Summarize Theses and Dissertationsen
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
ETDSummarizationSourceCode.tar.gz
Size:
206.68 KB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
ETDSummarizationReport.pdf
Size:
1.56 MB
Format:
Adobe Portable Document Format
Name:
ETDSummarizationReport.zip
Size:
813.25 KB
Format:
Name:
ETDSummarizationPresentation.pptx
Size:
873.24 KB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
ETDSummarizationPresentation.pdf
Size:
5.08 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: