Show simple item record

dc.contributor.authorKhawas, Prapti
dc.contributor.authorBanerjee, Bipasha
dc.contributor.authorZhao, Shuqi
dc.contributor.authorFan, Yiyang
dc.contributor.authorKim, Yoonjin
dc.description.abstractThe goal of this work is to generate summaries of two Maryland shooting events from a large collection of web pages related to a shooting at Great Mills High School and another at the Capital Gazette newsroom. Since our team did not have prior experience with Computational Linguistics / Natural Language Processing (NLP), we followed an approach where we built summaries using 10 different methods, as suggested by course instructor Dr. Edward Fox, with each method being more sophisticated than the previous ones, to enable learning of key concepts in NLP. First, we started with finding a set of most frequent important words. Then, we found other words occurring in the articles which mean the same as the frequent words found. Along with the synonyms, we found sets of hypernyms and hyponyms. We identified a set of words constrained by POS, e.g., nouns and verbs. We then tried out various classification techniques in Apache Mahout to classify the documents into the two different events and eliminate irrelevant documents. Next, we identified a set of frequent and important named entities using NLTK and SpaCy Named Entity Recognition (NER) modules. We identified a set of important topics identified using Latent Dirichlet Allocation (LDA). We then generated clusters of documents using K-means. Next, we extracted a set of values for each slot matching collection semantics using regular expressions and generated a readable summary explaining the slots and values using a Context Free Grammar we developed. Finally, we used the Pointer Generator deep learning approach to generate a readable abstractive summary. Using the above approach, we generated two extractive summaries for newsroom shooting event and school shooting event with ROUGE-1 scores around 0.33 and 0.26 respectively. For the abstractive summaries, that we generated, the ROUGE-1 score was 0.36 for newsroom shooting event and 0.20 for school shooting event. We also evaluated the summaries at sentence level and we found that the abstractive school shooting summary had a higher ROUGE-1 score, being 0.88, than abstractive newsroom shooting summary with 0.73. We employed the Hadoop MapReduce framework to speed up the processing time for our large collection. We used various other tools like the NLTK language processing library and Apache Mahout, a distributed linear algebra framework to simplify our development. We learned that a variety of different methods and techniques which suit the collection are necessary in order to provide an accurate summary. We also learned the importance of cleaning the collection and challenges in the task.en_US
dc.description.sponsorshipNSF: IIS-1619028en_US
dc.publisherVirginia Techen_US
dc.rightsCC0 1.0 Universal*
dc.subjectmaryland shootingen_US
dc.subjectbig data summarizationen_US
dc.subjecttext summarizationen_US
dc.subjectnatural language processingen_US
dc.subjectwebpage collectionen_US
dc.titleSummarization of Maryland Shooting Collectionen_US
dc.description.notesDetails of the files included: MarylandShooting-Presentation.pdf - Final class presentation in PDF; MarylandShooting-Presentation.pptx - Final class presentation in PowerPoint format; MarylandShooting-Report.pdf - Final report in PDF; - Final report archive from LaTex project in Overleaf; - Code developed for summarization as described in the report.en_US

Files in this item


This item appears in the following Collection(s)

Show simple item record

CC0 1.0 Universal
License: CC0 1.0 Universal