Show simple item record

dc.contributor.authorChandrasekaran, Arjun
dc.contributor.authorSharma, Saurav
dc.contributor.authorSulucz, Peter
dc.contributor.authorTran, Jonathan
dc.descriptionFilename and description of all included files: ShootingsReportPdf.pdf - PDF version of our project report. ShootingsReportDoc.docx - MS Word version of our project report. ShootingsPresentationPpt.pptx - MS PowerPoint version of our project presentation. ShootingsPresentationPdf.pdf - PDF version of our project presentation. Source code is included in folder "shooting_summary_code": shooting_summary_code/ - The map script used on the Hadoop cluster. This file contains the regular expressions which are used to extract the data from the collections. shooting_summary_code/ - Standard frequency based reducer on the cluster. Takes sorted input from the mapper, and reduces based on frequency. shooting_summary_code/ - Python which parses the output from the mapper and reducer. Does not run on Hadoop cluster. Relies on data being sorted by frequency (most frequent at top of file). Contains the implementation of the regular grammar, along with filtering techniques. shooting_summary_code/TrigramTagger.pkl - A Python pickled version of our trigram tagger, which is used in
dc.description.abstractWe describe our approach to generating summaries of a shooting event from a large collection of webpages. We work with two separate events - a shooting at a school in Newtown, Connecticut and another at a mall in Tucson, Arizona. Our corpora of webpages are inherently noisy and contain a large amount of irrelevant information. In our approach, we attempt to clean up our webpage collection by removing all irrelevant content. For this, we utilize natural language processing techniques such as word frequency analysis, part of speech tagging and named entity recognition to identify key words about our news events. Using these key words as features, we employ classification techniques to categorize each document as relevant or irrelevant. We discard the documents classified as irrelevant. We observe that to generate a summary, we require some specific information that enables us to answer important questions such as "Who was the killer?", "Where did the shooting happen?", "How many casualties were there?" and so on. To enable extraction of these essential details from news articles, we design a template of the event summary with slots that pertain to information we would like to extract. We designed regular expressions to identify a number of 'candidate' values for the template slots. Using a combination of word frequency analysis and specific validation techniques, we choose the top candidate for each slot of our template. We use a grammar based on our template to generate a human readable summary of each event. We utilize the Hadoop MapReduce framework to parallelize our workflow, along with the NLTK language processing library to simplify and speed our development. We learned that a variety of different methods and techniques are necessary in order to provide an accurate summary for any collection. It is seen that cleaning poses an incredibly difficult yet necessary task when attempting to semantically interpret data. We found that our attempts to extract relevant topics and sentences using the topic extraction method Latent Dirichlet Allocation and k-means clustering did not result in topics and sentences that were indicative of our corpus. We demonstrate an effective way of summarizing a shooting event that extracts relevant information by using regular expressions and generates a comprehensive human-readable summary utilizing a regular grammar. Our solution generates a summary that includes key information needed in understanding a shooting event such as: the shooter(s), date of the shooting, location of the shooting, number of people injured and wounded, and the weapon used. This solution is shown to work effectively for two different types of shootings: a mass murder, and an assassination attempt.en_US
dc.description.sponsorshipNSF DUE-1141209 and IIS-1319578en_US
dc.rightsCC0 1.0 Universal*
dc.subjectdocument summarizationen_US
dc.subjectcomputational linguisticsen_US
dc.subjectnatural language generationen_US
dc.subjectnatural language processingen_US
dc.subjectshooting summaryen_US
dc.subjectnews summarizationen_US
dc.subjectResearch Subject Categories::TECHNOLOGYen_US
dc.subjectcomputational linguisticsen_US
dc.titleGenerating an Intelligent Human-Readable Summary of a Shooting Event from a Large Collection of Webpagesen_US
dc.title.alternativeGenerating a Summary for a Shooting Eventen_US
dc.typeTechnical Reporten_US

Files in this item


This item appears in the following Collection(s)

Show simple item record

CC0 1.0 Universal
License: CC0 1.0 Universal