VTechWorks staff will be away for the Thanksgiving holiday beginning at noon on Wednesday, November 27, through Friday, November 29. We will resume normal operations on Monday, December 2. Thank you for your patience.
 

Summarization of Maryland Shooting Collection

dc.contributor.authorKhawas, Praptien
dc.contributor.authorBanerjee, Bipashaen
dc.contributor.authorZhao, Shuqien
dc.contributor.authorFan, Yiyangen
dc.contributor.authorKim, Yoonjinen
dc.date.accessioned2018-12-15T01:18:20Zen
dc.date.available2018-12-15T01:18:20Zen
dc.date.issued2018-12-12en
dc.description.abstractThe goal of this work is to generate summaries of two Maryland shooting events from a large collection of web pages related to a shooting at Great Mills High School and another at the Capital Gazette newsroom. Since our team did not have prior experience with Computational Linguistics / Natural Language Processing (NLP), we followed an approach where we built summaries using 10 different methods, as suggested by course instructor Dr. Edward Fox, with each method being more sophisticated than the previous ones, to enable learning of key concepts in NLP. First, we started with finding a set of most frequent important words. Then, we found other words occurring in the articles which mean the same as the frequent words found. Along with the synonyms, we found sets of hypernyms and hyponyms. We identified a set of words constrained by POS, e.g., nouns and verbs. We then tried out various classification techniques in Apache Mahout to classify the documents into the two different events and eliminate irrelevant documents. Next, we identified a set of frequent and important named entities using NLTK and SpaCy Named Entity Recognition (NER) modules. We identified a set of important topics identified using Latent Dirichlet Allocation (LDA). We then generated clusters of documents using K-means. Next, we extracted a set of values for each slot matching collection semantics using regular expressions and generated a readable summary explaining the slots and values using a Context Free Grammar we developed. Finally, we used the Pointer Generator deep learning approach to generate a readable abstractive summary. Using the above approach, we generated two extractive summaries for newsroom shooting event and school shooting event with ROUGE-1 scores around 0.33 and 0.26 respectively. For the abstractive summaries, that we generated, the ROUGE-1 score was 0.36 for newsroom shooting event and 0.20 for school shooting event. We also evaluated the summaries at sentence level and we found that the abstractive school shooting summary had a higher ROUGE-1 score, being 0.88, than abstractive newsroom shooting summary with 0.73. We employed the Hadoop MapReduce framework to speed up the processing time for our large collection. We used various other tools like the NLTK language processing library and Apache Mahout, a distributed linear algebra framework to simplify our development. We learned that a variety of different methods and techniques which suit the collection are necessary in order to provide an accurate summary. We also learned the importance of cleaning the collection and challenges in the task.en
dc.description.notesDetails of the files included: MarylandShooting-Presentation.pdf - Final class presentation in PDF; MarylandShooting-Presentation.pptx - Final class presentation in PowerPoint format; MarylandShooting-Report.pdf - Final report in PDF; MarylandShooting-ReportSource.zip - Final report archive from LaTex project in Overleaf; MarylandShooting-SourceCode.zip - Code developed for summarization as described in the report.en
dc.description.sponsorshipNSF: IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/86407en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectmaryland shootingen
dc.subjectbig data summarizationen
dc.subjecttext summarizationen
dc.subjectnatural language processingen
dc.subjectwebpage collectionen
dc.subjecthadoopen
dc.subjectmahouten
dc.titleSummarization of Maryland Shooting Collectionen
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Loading...
Thumbnail Image
Name:
MarylandShooting-Presentation.pdf
Size:
115.34 KB
Format:
Adobe Portable Document Format
Name:
MarylandShooting-Presentation.pptx
Size:
702.59 KB
Format:
Microsoft Powerpoint XML
Name:
MarylandShooting-SourceCode.zip
Size:
13.9 KB
Format:
Loading...
Thumbnail Image
Name:
MarylandShooting-Report.pdf
Size:
282.99 KB
Format:
Adobe Portable Document Format
Name:
MarylandShooting-ReportSource.zip
Size:
93.42 KB
Format:
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: