CS4984/CS5984: Big Data Text Summarization Team 10 ETDs

Baghudana, Ashish; Li, Guangchen; Liu, Beichen; Lasky, Stephen

CS4984/CS5984: Big Data Text Summarization Team 10 ETDs

dc.contributor.author	Baghudana, Ashish	en
dc.contributor.author	Li, Guangchen	en
dc.contributor.author	Liu, Beichen	en
dc.contributor.author	Lasky, Stephen	en
dc.date.accessioned	2018-12-15T18:38:06Z	en
dc.date.available	2018-12-15T18:38:06Z	en
dc.date.issued	2018-12-14	en
dc.description.abstract	Automatic text summarization is the task of creating accurate and succinct summaries of text documents. These documents can vary from newspaper articles to more academic content such as theses and dissertations. The two domains differ significantly in sentence structure and vocabulary, as well as in the length of the documents, with theses and dissertations being more verbose and using a very specialized vocabulary. Summarization techniques are broadly classified into extractive and abstractive styles - the former where salient sentences are extracted from the text without any modification and the latter where sentences are modified and paraphrased. Recent developments in neural networks, language modeling, and machine translation have spurred research into abstractive text summarization. Models developed recently are generally trained on news articles, specifically CNN and DailyMail, both of which have more readily available summaries available through public datasets. In this project, we apply recent deep-learning techniques of text summarization to produce summaries of electronic theses and dissertations from VTechWorks, Virginia Tech's online repository of scholarly work. We overcome the challenge posed by different vocabularies by creating a dataset of pre-print articles from ArXiv and training summarization models on these documents. The ArXiv collection consists of approximately 4500 articles, each of which has an abstract and the corresponding full text. For the purposes of training summarization models, we consider the abstract as the summary of the document. We split this dataset into a train, test, and validation set of 3155, 707, and 680 documents respectively. We also prepare gold standard summaries from chapters of electronic thesis and dissertations. Subsequently, we train pointer generator networks on the ArXiv dataset and evaluate the trained models using ROUGE scores. The ROUGE scores are reported on both the test split of the ArXiv dataset, as well as for the gold standard summaries. While the ROUGE scores do not indicate state-of-the-art performance, we do not find any equivalent work in summarization of academic content to compare against.	en
dc.description.notes	The submission contains multiple files: - CS5984_Final_Presentation.pdf: The PDF version of the presentation. - CS5984_Final_Presentation.zip: The LaTeX source code for the presentation. - CS5984_Final_Report.pdf: The PDF version of the report. - CS5984_Final_Report.zip: The LaTeX source code for the report. - source_code.zip: The zip file contains three zip files - pointer_summarizer.zip, pointer_generator_data.zip, pdf-data-extraction.zip. These contain the source code for training PGNs, generating data for PGNs, and using ScienceParse/Grobid to extract text from PDFs.	en
dc.description.sponsorship	NSF: IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/86418	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons CC0 1.0 Universal Public Domain Dedication	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	text summarization	en
dc.subject	Machine learning	en
dc.subject	deep learning	en
dc.subject	abstractive text summarization	en
dc.subject	arxiv	en
dc.subject	rouge	en
dc.subject	pointer generator networks	en
dc.subject	seq2seq	en
dc.title	CS4984/CS5984: Big Data Text Summarization Team 10 ETDs	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en