Summarization of Maryland Shooting Collection

Abstract

The goal of this work is to generate summaries of two Maryland shooting events from a large collection of web pages related to a shooting at Great Mills High School and another at the Capital Gazette newsroom. Since our team did not have prior experience with Computational Linguistics / Natural Language Processing (NLP), we followed an approach where we built summaries using 10 different methods, as suggested by course instructor Dr. Edward Fox, with each method being more sophisticated than the previous ones, to enable learning of key concepts in NLP.

First, we started with finding a set of most frequent important words. Then, we found other words occurring in the articles which mean the same as the frequent words found. Along with the synonyms, we found sets of hypernyms and hyponyms. We identified a set of words constrained by POS, e.g., nouns and verbs. We then tried out various classification techniques in Apache Mahout to classify the documents into the two different events and eliminate irrelevant documents. Next, we identified a set of frequent and important named entities using NLTK and SpaCy Named Entity Recognition (NER) modules. We identified a set of important topics identified using Latent Dirichlet Allocation (LDA). We then generated clusters of documents using K-means. Next, we extracted a set of values for each slot matching collection semantics using regular expressions and generated a readable summary explaining the slots and values using a Context Free Grammar we developed. Finally, we used the Pointer Generator deep learning approach to generate a readable abstractive summary.

Using the above approach, we generated two extractive summaries for newsroom shooting event and school shooting event with ROUGE-1 scores around 0.33 and 0.26 respectively. For the abstractive summaries, that we generated, the ROUGE-1 score was 0.36 for newsroom shooting event and 0.20 for school shooting event. We also evaluated the summaries at sentence level and we found that the abstractive school shooting summary had a higher ROUGE-1 score, being 0.88, than abstractive newsroom shooting summary with 0.73.

We employed the Hadoop MapReduce framework to speed up the processing time for our large collection. We used various other tools like the NLTK language processing library and Apache Mahout, a distributed linear algebra framework to simplify our development. We learned that a variety of different methods and techniques which suit the collection are necessary in order to provide an accurate summary. We also learned the importance of cleaning the collection and challenges in the task.

Description

Keywords

maryland shooting, big data summarization, text summarization, natural language processing, webpage collection, hadoop, mahout

Citation