CS4984/CS5984: Big Data Text Summarization Team 10 ETDs
MetadataShow full item record
Automatic text summarization is the task of creating accurate and succinct summaries of text documents. These documents can vary from newspaper articles to more academic content such as theses and dissertations. The two domains differ significantly in sentence structure and vocabulary, as well as in the length of the documents, with theses and dissertations being more verbose and using a very specialized vocabulary. Summarization techniques are broadly classified into extractive and abstractive styles - the former where salient sentences are extracted from the text without any modification and the latter where sentences are modified and paraphrased. Recent developments in neural networks, language modeling, and machine translation have spurred research into abstractive text summarization. Models developed recently are generally trained on news articles, specifically CNN and DailyMail, both of which have more readily available summaries available through public datasets. In this project, we apply recent deep-learning techniques of text summarization to produce summaries of electronic theses and dissertations from VTechWorks, Virginia Tech's online repository of scholarly work. We overcome the challenge posed by different vocabularies by creating a dataset of pre-print articles from ArXiv and training summarization models on these documents. The ArXiv collection consists of approximately 4500 articles, each of which has an abstract and the corresponding full text. For the purposes of training summarization models, we consider the abstract as the summary of the document. We split this dataset into a train, test, and validation set of 3155, 707, and 680 documents respectively. We also prepare gold standard summaries from chapters of electronic thesis and dissertations. Subsequently, we train pointer generator networks on the ArXiv dataset and evaluate the trained models using ROUGE scores. The ROUGE scores are reported on both the test split of the ArXiv dataset, as well as for the gold standard summaries. While the ROUGE scores do not indicate state-of-the-art performance, we do not find any equivalent work in summarization of academic content to compare against.