Summarizing Fire Events with Natural Language Processing

TR Number
Journal Title
Journal ISSN
Volume Title

Throughout this semester, we were driven by one question: how do we best summarize a fire with articles scraped from the internet? We took a variety of approaches to answer it, incrementally constructing a solution to summarize our events in a satisfactory manner.

We needed a considerable amount of data to process. This data came in the form of two separate corpora: one involving the Bastrop County, Texas wildfires of 2011 and the other the Kiss nightclub fire of 2013 in Santa Maria, Brazil. For our “small” collection, the Texas wildfires, we had approximately 16,000 text files. For our “large” collection, the nightclub fire, we had approximately 690,000 text files. Theoretically, each text file contained a single news article relating to the event. In reality, this was rarely true. As a result, we had to perform considerable preprocessing of our corpora to ensure useful outcomes.

The incremental steps to produce our final summary took the form of 9 units to be completed over the course of the semester, with each building on the work of the previous unit. Owing to our lack of domain knowledge at the beginning of the semester (with either fires or natural language processing), we were provided considerable guidance to produce naive, albeit useful, initial solutions. In the first few units, we summarized our collections with brute force approaches: choosing the most frequent words as descriptors, manually generating words to describe the collection, selecting descriptive lemmas, and more. Most of these approaches are characterized by arbitrarily selecting descriptors based on frequency alone, with little consideration for the underlying linguistic significance. 

From this, we transitioned to more intelligent approaches, attempting to utilize more fine grained techniques to remove extraneous information. We incorporated part-of-speech (POS) tagging to determine the speech type of a word, which allows us to select the most important nouns, for example. Using POS tagging, as well as an ever expanding stopword list, allowed us to remove much of the uninformative results. To further improve our collection, we needed a way to filter out more than just stopwords. In our case, we had many text files that were unrelated to corpus topics, which could corrupt or skew our results. To accomplish this, we built a document classifier to determine if articles are relevant and mark them appropriately, allowing us to include only the relevant articles in our processing. Despite this, our collection still suffered from considerable noise. 
In almost all of our units we employed various “big data” techniques and tools, including MapReduce and Mahout. These tools allowed us to process extremely large collections of data in an efficient manner. With these tools we could select the most relevant names, topics, and sentences, providing the framework for a summary of the entire collection. It is these insights that lead us to the final two sections of producing a summarization based on preconstructed templates of our events. Using a mixture of every technique we had learned we constructed paragraphs that summarized both fires we had in our collections.

For the final two units of our course, we were tasked with creating a paragraph summary of both the Texas Wildfire and the Brazil Nightclub Fire events. We began with a generic fire event template with a set of attributes that would be filled in with the best results we could extract. We made the decision early on to create separate templates for the more specific fire event types of wildfires and building fires, as there are some details which do not overlap among the two event types. In order to fill in our templates we created a process of extracting, refining and finally filling in our gathered results. In order to extract data from our corpora, we created a regular expression for each attribute type and stored any matches found. Next, using only the top 10 results for each attribute, we filtered results by part of speech, constructed a simple grammar to modify the template according to our selected result, and conjugated any present tense verbs to past tense.
There are both PowerPoint and PDF file versions of the final presentation. Also, there are both Word and PDF file versions of the final report.
Natural Language Processing, big data, MapReduce, LDA, Mahout, NER, k-means, Linguistics, Fire, Hadoop