Generating an Intelligent Human-Readable Summary of a Shooting Event from a Large Collection of Webpages
MetadataShow full item record
We describe our approach to generating summaries of a shooting event from a large collection of webpages. We work with two separate events - a shooting at a school in Newtown, Connecticut and another at a mall in Tucson, Arizona. Our corpora of webpages are inherently noisy and contain a large amount of irrelevant information. In our approach, we attempt to clean up our webpage collection by removing all irrelevant content. For this, we utilize natural language processing techniques such as word frequency analysis, part of speech tagging and named entity recognition to identify key words about our news events. Using these key words as features, we employ classification techniques to categorize each document as relevant or irrelevant. We discard the documents classified as irrelevant. We observe that to generate a summary, we require some specific information that enables us to answer important questions such as "Who was the killer?", "Where did the shooting happen?", "How many casualties were there?" and so on. To enable extraction of these essential details from news articles, we design a template of the event summary with slots that pertain to information we would like to extract. We designed regular expressions to identify a number of 'candidate' values for the template slots. Using a combination of word frequency analysis and specific validation techniques, we choose the top candidate for each slot of our template. We use a grammar based on our template to generate a human readable summary of each event. We utilize the Hadoop MapReduce framework to parallelize our workflow, along with the NLTK language processing library to simplify and speed our development. We learned that a variety of different methods and techniques are necessary in order to provide an accurate summary for any collection. It is seen that cleaning poses an incredibly difficult yet necessary task when attempting to semantically interpret data. We found that our attempts to extract relevant topics and sentences using the topic extraction method Latent Dirichlet Allocation and k-means clustering did not result in topics and sentences that were indicative of our corpus. We demonstrate an effective way of summarizing a shooting event that extracts relevant information by using regular expressions and generates a comprehensive human-readable summary utilizing a regular grammar. Our solution generates a summary that includes key information needed in understanding a shooting event such as: the shooter(s), date of the shooting, location of the shooting, number of people injured and wounded, and the weapon used. This solution is shown to work effectively for two different types of shootings: a mass murder, and an assassination attempt.