CS4984/CS5984: Big Data Text Summarization Team 17 ETDs

Khaghani, Farnaz; Marin Thomas, Ashin; Patnayak, Chinmaya; Sharma, Dhruv; Aromando, John

CS4984/CS5984: Big Data Text Summarization Team 17 ETDs

dc.contributor.author	Khaghani, Farnaz	en
dc.contributor.author	Marin Thomas, Ashin	en
dc.contributor.author	Patnayak, Chinmaya	en
dc.contributor.author	Sharma, Dhruv	en
dc.contributor.author	Aromando, John	en
dc.date.accessioned	2018-12-15T20:43:21Z	en
dc.date.available	2018-12-15T20:43:21Z	en
dc.date.issued	2018-12-15	en
dc.description.abstract	Given the current explosion of information over various media such as electronic and physical texts, concise and relevant data has become key to the understanding of things. Summarization, which essentially is the process of reducing the text to convey only the salient aspects, has emerged as a challenging task in the field of Natural Language Processing. In a scientific construct, academia has been generating voluminous amounts of data in the form of theses and dissertations. Obtaining the chapter-wise summary of an electronic thesis or dissertation can be a computationally expensive task, particularly because of its length and the subject to which it pertains to. Through this course, research and development of various summarization techniques, primarily extractive and abstractive summarization, were analyzed. There have been various developments in the field of deep learning to tackle problems related to summarization and produce coherent and meaningful summaries for news articles. In this project, tools that could be used to generate coherent and concise summaries of long electronic theses and dissertations (ETDs) were developed as well. The major concern initially was to get the text from a PDF file of an ETD. GROBID and Scienceparse were used as pre-processing tools to carry out this task and presented the text from a PDF in a structured format such as XML or JSON file. The outputs from each of the tools were compared qualitatively as well as quantitatively. After this, a transfer learning approach was adopted, wherein a pre-trained model was tweaked to fit to the task of summarizing each ETD. This came in as a challenge to make the model learn the nuances of an ETD. An iterative approach was used to explore various networks, each trying to improve the shortcomings of the previous one in its novel way. Existing deep learning models including Sequence-2-Sequence, Pointer Generator Networks, and A Hybrid Extractive-Abstractive Reinforce-Selecting Sentence Rewriting Network, were used to generate and test summaries. Further tweaks were made to these deep neural networks to account for much longer and varied datasets as compared to what they were inherently designed to work for -- in this case ETDs. A thorough evaluation of these generated summaries was also done with respect to golden standards for five dissertations and theses created during the span of the course. ROUGE-1, ROUGE-2, and ROUGE-SU4 were used to compare the generated summaries with the golden standards. The average ROUGE scores were 0.1387, 0.1224, and 0.0480 respectively. These low ROUGE scores could be attributed to the varying summary length, and also to the complexity of the task of summarizing an ETD. The scope of improvements and the underlying reasons for the performance have also been analyzed. The conclusion that can be drawn from the project is that any machine learning task is highly biased by what pattern is inherently present in the data on which it is being trained. In the context of summarization, there can be a different perspective from which an article can be summarized, and thus the quantitative evaluation measures can vary drastically even after the summary is a coherent one.	en
dc.description.notes	The submission contains multiple files: - CS5984_Final_Presentation.pdf: The PDF version of the presentation. - CS5984_Final_Presentation.ppt: The PowerPoint for the presentation. - CS5984_Final_Report.pdf: The PDF version of the report. - CS5984_Final_Report.zip: The LaTeX source code for the report. - ArXiv finished file: processed and tokenized arXiv data for Pointer Generator Network -text-summarization-tensorflow: seq2seq model code in TensorFlow modified to adapt with arXiv dataset	en
dc.description.sponsorship	NSF: IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/86420	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Deep learning	en
dc.subject	Abstractive summarization	en
dc.subject	Electronic Thesis and Dissertation	en
dc.subject	Reinforcement learning	en
dc.subject	ROUGE	en
dc.subject	Pointer Generator Networks	en
dc.subject	seq2seq	en
dc.subject	Machine learning	en
dc.title	CS4984/CS5984: Big Data Text Summarization Team 17 ETDs	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en

Files

Original bundle

Now showing 1 - 5 of 6

Name:: Arxiv_finished_files.zip
Size:: 1.72 MB
Format:

Download

Name:: text-summarization-tensorflow-master.zip
Size:: 465.52 KB
Format:

Download

Name:: CS5984_Final_Presentation.pdf
Size:: 984.98 KB
Format:: Adobe Portable Document Format
Description:

Download

Name:: CS5984_Final_Presentation.pptx
Size:: 1.57 MB
Format:: Microsoft Powerpoint XML

Download

Name:: CS5984_Final_Report.pdf
Size:: 6.97 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

CS4984: Special Topics