CS4984/CS5984: Big Data Text Summarization Team 17 ETDs

dc.contributor.authorKhaghani, Farnazen
dc.contributor.authorMarin Thomas, Ashinen
dc.contributor.authorPatnayak, Chinmayaen
dc.contributor.authorSharma, Dhruven
dc.contributor.authorAromando, Johnen
dc.date.accessioned2018-12-15T20:43:21Zen
dc.date.available2018-12-15T20:43:21Zen
dc.date.issued2018-12-15en
dc.description.abstractGiven the current explosion of information over various media such as electronic and physical texts, concise and relevant data has become key to the understanding of things. Summarization, which essentially is the process of reducing the text to convey only the salient aspects, has emerged as a challenging task in the field of Natural Language Processing. In a scientific construct, academia has been generating voluminous amounts of data in the form of theses and dissertations. Obtaining the chapter-wise summary of an electronic thesis or dissertation can be a computationally expensive task, particularly because of its length and the subject to which it pertains to. Through this course, research and development of various summarization techniques, primarily extractive and abstractive summarization, were analyzed. There have been various developments in the field of deep learning to tackle problems related to summarization and produce coherent and meaningful summaries for news articles. In this project, tools that could be used to generate coherent and concise summaries of long electronic theses and dissertations (ETDs) were developed as well. The major concern initially was to get the text from a PDF file of an ETD. GROBID and Scienceparse were used as pre-processing tools to carry out this task and presented the text from a PDF in a structured format such as XML or JSON file. The outputs from each of the tools were compared qualitatively as well as quantitatively. After this, a transfer learning approach was adopted, wherein a pre-trained model was tweaked to fit to the task of summarizing each ETD. This came in as a challenge to make the model learn the nuances of an ETD. An iterative approach was used to explore various networks, each trying to improve the shortcomings of the previous one in its novel way. Existing deep learning models including Sequence-2-Sequence, Pointer Generator Networks, and A Hybrid Extractive-Abstractive Reinforce-Selecting Sentence Rewriting Network, were used to generate and test summaries. Further tweaks were made to these deep neural networks to account for much longer and varied datasets as compared to what they were inherently designed to work for -- in this case ETDs. A thorough evaluation of these generated summaries was also done with respect to golden standards for five dissertations and theses created during the span of the course. ROUGE-1, ROUGE-2, and ROUGE-SU4 were used to compare the generated summaries with the golden standards. The average ROUGE scores were 0.1387, 0.1224, and 0.0480 respectively. These low ROUGE scores could be attributed to the varying summary length, and also to the complexity of the task of summarizing an ETD. The scope of improvements and the underlying reasons for the performance have also been analyzed. The conclusion that can be drawn from the project is that any machine learning task is highly biased by what pattern is inherently present in the data on which it is being trained. In the context of summarization, there can be a different perspective from which an article can be summarized, and thus the quantitative evaluation measures can vary drastically even after the summary is a coherent one.en
dc.description.notesThe submission contains multiple files: - CS5984_Final_Presentation.pdf: The PDF version of the presentation. - CS5984_Final_Presentation.ppt: The PowerPoint for the presentation. - CS5984_Final_Report.pdf: The PDF version of the report. - CS5984_Final_Report.zip: The LaTeX source code for the report. - ArXiv finished file: processed and tokenized arXiv data for Pointer Generator Network -text-summarization-tensorflow: seq2seq model code in TensorFlow modified to adapt with arXiv dataseten
dc.description.sponsorshipNSF: IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/86420en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectDeep learningen
dc.subjectAbstractive summarizationen
dc.subjectElectronic Thesis and Dissertationen
dc.subjectReinforcement learningen
dc.subjectROUGEen
dc.subjectPointer Generator Networksen
dc.subjectseq2seqen
dc.subjectMachine learningen
dc.titleCS4984/CS5984: Big Data Text Summarization Team 17 ETDsen
dc.typeDataseten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 6
Name:
Arxiv_finished_files.zip
Size:
1.72 MB
Format:
Name:
text-summarization-tensorflow-master.zip
Size:
465.52 KB
Format:
Loading...
Thumbnail Image
Name:
CS5984_Final_Presentation.pdf
Size:
984.98 KB
Format:
Adobe Portable Document Format
Description:
Name:
CS5984_Final_Presentation.pptx
Size:
1.57 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
CS5984_Final_Report.pdf
Size:
6.97 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: