Big Data Text Summarization - Hurricane Irma

Chava, Raja Venkata Satya Phanindra; Dhar, Siddharth; Gaur, Yamini; Rambhakta, Pranavi; Shetty, Sourabh

Big Data Text Summarization - Hurricane Irma

dc.contributor.author	Chava, Raja Venkata Satya Phanindra	en
dc.contributor.author	Dhar, Siddharth	en
dc.contributor.author	Gaur, Yamini	en
dc.contributor.author	Rambhakta, Pranavi	en
dc.contributor.author	Shetty, Sourabh	en
dc.date.accessioned	2018-12-13T15:22:01Z	en
dc.date.available	2018-12-13T15:22:01Z	en
dc.date.issued	2018-12-13	en
dc.description.abstract	With the increased rate of content generation on the Internet, there is a pressing need for making tools to automate the process of extracting meaningful data. Big data analytics deals with researching patterns or implicit correlations within a large collection of data. There are several sources to get data from, such as news websites, social media platforms (for example FaceBook and Twitter), sensors, and other IoT (Internet of Things) devices. Social media platforms like Twitter prove to be important sources of data collection since the level of activity increases significantly during major events such as hurricanes, floods, and events of global importance. For generating summaries, we first had to convert the WARC file which was given to us, into JSON format, which was more understandable. We then cleaned the text by removing boilerplate and redundant information. After that, we proceeded with removing stopwords and getting a collection of the most important words occurring in the documents. This ensured that the resulting summary would have important information from our corpus and would still be able to answer all the questions. One of the challenges that we faced at this point was to decide how to correlate words in order to get the most relevant words out of a document. We tried several techniques such as TF-IDF in order to resolve this. Correlation of different words with each other is an important factor in generating a cohesive summary because while a word may not be in the list of most commonly occurring words in the corpus, it could still be relevant and give significant information about the event. Due to the occurrence of Hurricane Irma around the same time as the occurrence of Hurricane Harvey, a large number of documents were not about Hurricane Irma. Due to this, all such documents were eliminated as they were deemed non-relevant. Classification of documents as relevant or non-relevant ensured that our deep learning summaries were not getting generated on data that was not crucial in building our final summary. Initially, we attempted to use Mahout classifiers, but the results obtained were not satisfactory. Instead, we used a much simpler world filtering approach for classification which has eliminated a significant number of documents by classifying them as non-relevant. We used the Pointer-Generator technique, which implements a Recurrent Neural Network (RNN) for building the deep learning abstractive summary. We combined data from multiple relevant documents into a single document, and thus generated multiple summaries, each corresponding to a set of documents. We wrote a Python script to perform post-processing on the generated summary to convert all the alphabetic characters after a period and space to uppercase. This was important because for lemmatization, stopword removal, and POS tagging, the whole dataset is converted to lowercase. The script also converts the first alphabetic character of all POS-tagged proper nouns to upper case. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used to evaluate the generated summary against the golden standard summary. The abstractive summary returns good evaluation results when compared with the Golden Standard on the ROUGE_sent evaluation. The ROUGE_para and cov_entity evaluation results were not up to the mark, but we feel that was mainly due to the writing style of the Gold Standard as our abstractive summary was able provide most of the information related to Hurricane Irma.	en
dc.description.notes	Big_Data_Text_Summarization_Report_Hurricane_Irma.pdf - Report in PDF format. Big_Data_Text_Summarization_Report_Hurricane_Irma.zip - Report in zip format. Hurricane Irma Final Presentation.pptx - Final Presentation in PowerPoint (pptx) format. Hurricane Irma Final Presentation.pdf - Final Presentation in PDF format. Code_files.zip - Code files compressed in a single zip file.	en
dc.description.sponsorship	NSF: IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/86372	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons CC0 1.0 Universal Public Domain Dedication	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	Text Classification	en
dc.subject	Abstractive Summary	en
dc.subject	Apache Spark	en
dc.subject	webpage	en
dc.subject	Deep learning (Machine learning)	en
dc.title	Big Data Text Summarization - Hurricane Irma	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en