Big Data Text Summarization - Hurricane Irma

Abstract

With the increased rate of content generation on the Internet, there is a pressing need for making tools to automate the process of extracting meaningful data. Big data analytics deals with researching patterns or implicit correlations within a large collection of data. There are several sources to get data from, such as news websites, social media platforms (for example FaceBook and Twitter), sensors, and other IoT (Internet of Things) devices. Social media platforms like Twitter prove to be important sources of data collection since the level of activity increases significantly during major events such as hurricanes, floods, and events of global importance. For generating summaries, we first had to convert the WARC file which was given to us, into JSON format, which was more understandable.

We then cleaned the text by removing boilerplate and redundant information. After that, we proceeded with removing stopwords and getting a collection of the most important words occurring in the documents. This ensured that the resulting summary would have important information from our corpus and would still be able to answer all the questions. One of the challenges that we faced at this point was to decide how to correlate words in order to get the most relevant words out of a document. We tried several techniques such as TF-IDF in order to resolve this. Correlation of different words with each other is an important factor in generating a cohesive summary because while a word may not be in the list of most commonly occurring words in the corpus, it could still be relevant and give significant information about the event. Due to the occurrence of Hurricane Irma around the same time as the occurrence of Hurricane Harvey, a large number of documents were not about Hurricane Irma. Due to this, all such documents were eliminated as they were deemed non-relevant. Classification of documents as relevant or non-relevant ensured that our deep learning summaries were not getting generated on data that was not crucial in building our final summary. Initially, we attempted to use Mahout classifiers, but the results obtained were not satisfactory. Instead, we used a much simpler world filtering approach for classification which has eliminated a significant number of documents by classifying them as non-relevant.

We used the Pointer-Generator technique, which implements a Recurrent Neural Network (RNN) for building the deep learning abstractive summary. We combined data from multiple relevant documents into a single document, and thus generated multiple summaries, each corresponding to a set of documents. We wrote a Python script to perform post-processing on the generated summary to convert all the alphabetic characters after a period and space to uppercase. This was important because for lemmatization, stopword removal, and POS tagging, the whole dataset is converted to lowercase. The script also converts the first alphabetic character of all POS-tagged proper nouns to upper case. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used to evaluate the generated summary against the golden standard summary. The abstractive summary returns good evaluation results when compared with the Golden Standard on the ROUGE_sent evaluation. The ROUGE_para and cov_entity evaluation results were not up to the mark, but we feel that was mainly due to the writing style of the Gold Standard as our abstractive summary was able provide most of the information related to Hurricane Irma.

Description
Keywords
Text Classification, Abstractive Summary, Apache Spark, webpage, Deep learning (Machine learning)
Citation