Summarizing Fire Events with Natural Language Processing

Plahn, Jordan; Zamani, Michael; Lee, Hayden; Trujillo, Michael

Summarizing Fire Events with Natural Language Processing

dc.contributor.author	Plahn, Jordan	en
dc.contributor.author	Zamani, Michael	en
dc.contributor.author	Lee, Hayden	en
dc.contributor.author	Trujillo, Michael	en
dc.date.accessioned	2014-12-13T19:18:37Z	en
dc.date.available	2014-12-13T19:18:37Z	en
dc.date.issued	2014-12	en
dc.description	There are both PowerPoint and PDF file versions of the final presentation. Also, there are both Word and PDF file versions of the final report.	en
dc.description.abstract	Throughout this semester, we were driven by one question: how do we best summarize a fire with articles scraped from the internet? We took a variety of approaches to answer it, incrementally constructing a solution to summarize our events in a satisfactory manner. We needed a considerable amount of data to process. This data came in the form of two separate corpora: one involving the Bastrop County, Texas wildfires of 2011 and the other the Kiss nightclub fire of 2013 in Santa Maria, Brazil. For our “small” collection, the Texas wildfires, we had approximately 16,000 text files. For our “large” collection, the nightclub fire, we had approximately 690,000 text files. Theoretically, each text file contained a single news article relating to the event. In reality, this was rarely true. As a result, we had to perform considerable preprocessing of our corpora to ensure useful outcomes. The incremental steps to produce our final summary took the form of 9 units to be completed over the course of the semester, with each building on the work of the previous unit. Owing to our lack of domain knowledge at the beginning of the semester (with either fires or natural language processing), we were provided considerable guidance to produce naive, albeit useful, initial solutions. In the first few units, we summarized our collections with brute force approaches: choosing the most frequent words as descriptors, manually generating words to describe the collection, selecting descriptive lemmas, and more. Most of these approaches are characterized by arbitrarily selecting descriptors based on frequency alone, with little consideration for the underlying linguistic significance. From this, we transitioned to more intelligent approaches, attempting to utilize more fine grained techniques to remove extraneous information. We incorporated part-of-speech (POS) tagging to determine the speech type of a word, which allows us to select the most important nouns, for example. Using POS tagging, as well as an ever expanding stopword list, allowed us to remove much of the uninformative results. To further improve our collection, we needed a way to filter out more than just stopwords. In our case, we had many text files that were unrelated to corpus topics, which could corrupt or skew our results. To accomplish this, we built a document classifier to determine if articles are relevant and mark them appropriately, allowing us to include only the relevant articles in our processing. Despite this, our collection still suffered from considerable noise. In almost all of our units we employed various “big data” techniques and tools, including MapReduce and Mahout. These tools allowed us to process extremely large collections of data in an efficient manner. With these tools we could select the most relevant names, topics, and sentences, providing the framework for a summary of the entire collection. It is these insights that lead us to the final two sections of producing a summarization based on preconstructed templates of our events. Using a mixture of every technique we had learned we constructed paragraphs that summarized both fires we had in our collections. For the final two units of our course, we were tasked with creating a paragraph summary of both the Texas Wildfire and the Brazil Nightclub Fire events. We began with a generic fire event template with a set of attributes that would be filled in with the best results we could extract. We made the decision early on to create separate templates for the more specific fire event types of wildfires and building fires, as there are some details which do not overlap among the two event types. In order to fill in our templates we created a process of extracting, refining and finally filling in our gathered results. In order to extract data from our corpora, we created a regular expression for each attribute type and stored any matches found. Next, using only the top 10 results for each attribute, we filtered results by part of speech, constructed a simple grammar to modify the template according to our selected result, and conjugated any present tense verbs to past tense.	en
dc.description.sponsorship	NSF DUE-1141209 and IIS-1319578	en
dc.identifier.uri	http://hdl.handle.net/10919/51131	en
dc.language.iso	en_US	en
dc.rights	Creative Commons Attribution-NonCommercial 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/us/	en
dc.subject	Natural Language Processing	en
dc.subject	big data	en
dc.subject	MapReduce	en
dc.subject	LDA	en
dc.subject	Mahout	en
dc.subject	NER	en
dc.subject	k-means	en
dc.subject	Linguistics	en
dc.subject	Fire	en
dc.subject	Hadoop	en
dc.title	Summarizing Fire Events with Natural Language Processing	en
dc.type	Presentation	en
dc.type	Technical report	en

Summarizing Fire Events with Natural Language Processing

Files

Original bundle

License bundle

Collections