VTechWorks staff will be away for the Thanksgiving holiday beginning at noon on Wednesday, November 27, through Friday, November 29. We will resume normal operations on Monday, December 2. Thank you for your patience.
 

CS4984/CS5984: Big Data Text Summarization Team 10 ETDs

dc.contributor.authorBaghudana, Ashishen
dc.contributor.authorLi, Guangchenen
dc.contributor.authorLiu, Beichenen
dc.contributor.authorLasky, Stephenen
dc.date.accessioned2018-12-15T18:38:06Zen
dc.date.available2018-12-15T18:38:06Zen
dc.date.issued2018-12-14en
dc.description.abstractAutomatic text summarization is the task of creating accurate and succinct summaries of text documents. These documents can vary from newspaper articles to more academic content such as theses and dissertations. The two domains differ significantly in sentence structure and vocabulary, as well as in the length of the documents, with theses and dissertations being more verbose and using a very specialized vocabulary. Summarization techniques are broadly classified into extractive and abstractive styles - the former where salient sentences are extracted from the text without any modification and the latter where sentences are modified and paraphrased. Recent developments in neural networks, language modeling, and machine translation have spurred research into abstractive text summarization. Models developed recently are generally trained on news articles, specifically CNN and DailyMail, both of which have more readily available summaries available through public datasets. In this project, we apply recent deep-learning techniques of text summarization to produce summaries of electronic theses and dissertations from VTechWorks, Virginia Tech's online repository of scholarly work. We overcome the challenge posed by different vocabularies by creating a dataset of pre-print articles from ArXiv and training summarization models on these documents. The ArXiv collection consists of approximately 4500 articles, each of which has an abstract and the corresponding full text. For the purposes of training summarization models, we consider the abstract as the summary of the document. We split this dataset into a train, test, and validation set of 3155, 707, and 680 documents respectively. We also prepare gold standard summaries from chapters of electronic thesis and dissertations. Subsequently, we train pointer generator networks on the ArXiv dataset and evaluate the trained models using ROUGE scores. The ROUGE scores are reported on both the test split of the ArXiv dataset, as well as for the gold standard summaries. While the ROUGE scores do not indicate state-of-the-art performance, we do not find any equivalent work in summarization of academic content to compare against.en
dc.description.notesThe submission contains multiple files: - CS5984_Final_Presentation.pdf: The PDF version of the presentation. - CS5984_Final_Presentation.zip: The LaTeX source code for the presentation. - CS5984_Final_Report.pdf: The PDF version of the report. - CS5984_Final_Report.zip: The LaTeX source code for the report. - source_code.zip: The zip file contains three zip files - pointer_summarizer.zip, pointer_generator_data.zip, pdf-data-extraction.zip. These contain the source code for training PGNs, generating data for PGNs, and using ScienceParse/Grobid to extract text from PDFs.en
dc.description.sponsorshipNSF: IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/86418en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjecttext summarizationen
dc.subjectMachine learningen
dc.subjectdeep learningen
dc.subjectabstractive text summarizationen
dc.subjectarxiven
dc.subjectrougeen
dc.subjectpointer generator networksen
dc.subjectseq2seqen
dc.titleCS4984/CS5984: Big Data Text Summarization Team 10 ETDsen
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
CS5984_Final_Presentation.zip
Size:
45.37 KB
Format:
Name:
source_code.zip
Size:
3.09 MB
Format:
Loading...
Thumbnail Image
Name:
CS5984_Final_Presentation.pdf
Size:
63.64 KB
Format:
Adobe Portable Document Format
Name:
CS5984_Final_Report.zip
Size:
462.83 KB
Format:
Loading...
Thumbnail Image
Name:
CS5984_Final_Report.pdf
Size:
794.98 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: