Big Data Text Summarization - Attack Westminster
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Automatic text summarization, a process of distilling the most important information from a text document, is to create an abridged summary with software. Basically, in this task, we can regard the "summarization" as a function which takes a single document or multiple documents as an input and has the summary as an output. There are two ways that we can manage to create a summary: extractive and abstractive. The extractive summarization means that we select the most relevant sentences from the input and concatenate them to form a summary. Graph-based algorithm like TextRank, Feature-based models like TextTeaser, Topic-based models like Latent Semantic Analysis (LSA), and Grammar-based models could be viewed as approaches to extractive summarization. Abstractive summarization aims to create a summary similar to humans. It keeps the original intent, but uses new phrases and words not found in the original text. One of the most commonly used models is the encoder-decoder model, a neural network model that is mainly used in machine translation tasks. Recently, there is another combination approach that combines both extractive and abstractive summarization, like Pointer-Generator Network, and the Extract then Abstract model.
In this course, we're given both a small dataset (about 500 documents) and a big dataset (about 11,300 documents) that mainly consist of web archives about a specific event. Our group is focusing on reports about a terrorist event -- Attack Westminster. It occurred outside the Palace of Westminster in London on March 22, 2017. The attacker, 52 year-old Briton Khalid Masood, drove a car into pedestrians on the pavement, injuring more than 50 people, 5 of them fatally. The attack was treated as "Islamist-related terrorism".
We first created a Solr index for both the small dataset and the big dataset, which helped us to perform various queries to know more about the data. Additionally, the index aided another team to create a gold standard summary of our dataset for us. Then we gradually delved into different concepts and topics about text summarization, as well as natural language processing. Specifically, we managed to utilize the NLTK library and the spaCy package to create a set of most frequent important words, WordNet synsets that cover the word, words constrained by part of speech (POS), and frequent and important named entities. We also applied the LSA model to retrieve the most important topics. By clustering the dataset with k-means clustering, and selecting important sentences from the clusters using an implementation of the TextRank algorithm, we were able to generate a multi-paragraph summary. With the help of named entity recognition and pattern-based matching, we confidently extracted information like the name of the attacker, date, location, nearby landmarks, the number killed, the number injured, and the type of the attack. We then drafted a template of a readable summary to fill in the slots and values. Each of these results individually formed a summary that captures the most important information of the Westminster Attack.
The most successful results were obtained using the extractive summarization method (k-means clustering and TextRank), the slot-value method (named entity recognition and pattern-based matching), and the abstractive summarization method (deep learning). We evaluated each of the summaries obtained using a combination of ROUGE metrics as well as named-entity coverage compared to the gold standard summary created by team 3. Overall, the best summary was obtained using the extractive summarization method, with both ROUGE metrics and named-entity coverage outperforming other methods.