Generating an Intelligent Human-Readable Summary of a Shooting Event from a Large Collection of Webpages

Abstract

We describe our approach to generating summaries of a shooting event from a large collection of webpages. We work with two separate events - a shooting at a school in Newtown, Connecticut and another at a mall in Tucson, Arizona. Our corpora of webpages are inherently noisy and contain a large amount of irrelevant information. In our approach, we attempt to clean up our webpage collection by removing all irrelevant content. For this, we utilize natural language processing techniques such as word frequency analysis, part of speech tagging and named entity recognition to identify key words about our news events. Using these key words as features, we employ classification techniques to categorize each document as relevant or irrelevant. We discard the documents classified as irrelevant. We observe that to generate a summary, we require some specific information that enables us to answer important questions such as "Who was the killer?", "Where did the shooting happen?", "How many casualties were there?" and so on. To enable extraction of these essential details from news articles, we design a template of the event summary with slots that pertain to information we would like to extract. We designed regular expressions to identify a number of 'candidate' values for the template slots. Using a combination of word frequency analysis and specific validation techniques, we choose the top candidate for each slot of our template. We use a grammar based on our template to generate a human readable summary of each event. We utilize the Hadoop MapReduce framework to parallelize our workflow, along with the NLTK language processing library to simplify and speed our development. We learned that a variety of different methods and techniques are necessary in order to provide an accurate summary for any collection. It is seen that cleaning poses an incredibly difficult yet necessary task when attempting to semantically interpret data. We found that our attempts to extract relevant topics and sentences using the topic extraction method Latent Dirichlet Allocation and k-means clustering did not result in topics and sentences that were indicative of our corpus. We demonstrate an effective way of summarizing a shooting event that extracts relevant information by using regular expressions and generates a comprehensive human-readable summary utilizing a regular grammar. Our solution generates a summary that includes key information needed in understanding a shooting event such as: the shooter(s), date of the shooting, location of the shooting, number of people injured and wounded, and the weapon used. This solution is shown to work effectively for two different types of shootings: a mass murder, and an assassination attempt.

Description
Filename and description of all included files: ShootingsReportPdf.pdf - PDF version of our project report. ShootingsReportDoc.docx - MS Word version of our project report. ShootingsPresentationPpt.pptx - MS PowerPoint version of our project presentation. ShootingsPresentationPdf.pdf - PDF version of our project presentation. Source code is included in folder "shooting_summary_code": shooting_summary_code/mapper.py - The map script used on the Hadoop cluster. This file contains the regular expressions which are used to extract the data from the collections. shooting_summary_code/parse.py - Standard frequency based reducer on the cluster. Takes sorted input from the mapper, and reduces based on frequency. shooting_summary_code/reducer.py - Python which parses the output from the mapper and reducer. Does not run on Hadoop cluster. Relies on data being sorted by frequency (most frequent at top of file). Contains the implementation of the regular grammar, along with filtering techniques. shooting_summary_code/TrigramTagger.pkl - A Python pickled version of our trigram tagger, which is used in parse.py.
Keywords
document summarization, computational linguistics, natural language generation, natural language processing, shooting summary, news summarization, nlp, Research Subject Categories::TECHNOLOGY, computational linguistics
Citation