Hurricane Matthew Summarization

Abstract

The report, presentation, and code for our project for the course CS 4984/5984: Big Data Text Summarization are included in this submission. Our team had to explore methods of text summarization for two datasets, and report on our findings. The report covers our methods. The report starts with information on cleaning the data and filtering unnecessary documents. It then describes simple tasks such as counting the most common and important words and counting words by their part of speech. Following this, the report focuses on intermediate tasks such as clustering and finding LDA topics. Finally it presents our best methods for summarization, i.e., template and extractive summarization. We describe the algorithms, motivations, and conclusions we drew from each of our attempts. The report also contains a user and developer guide for using and maintaining our code, as well as a description of the tools and libraries we used. At the end there is also the Gold Standard Summary that we manually generated for another team in the course, to be used as a comparison for their automatically generated summary. We evaluated our automatically generated summary against a gold standard prepared by team 2, and found that our extractive summary performed the best based on its ROUGE scores. The source code zip file contains the code used for the tasks described in the report. The code was written in Python, and can be run only after installing the dependencies listed in the User Manual section of the report. The presentation file has the slides from the final presentation, containing much of the information in the report in a greatly simplified form. An editable version of the LaTeX document used to create our final report, and the editable PPTX file from our final presentation, are also included.

Description
Keywords
Text Summarization, Big Data, Hurricane Matthew, LDA Topic Modelling, POS Tagging, Template Summary, Extractive Summary
Citation