CS4984: Special Topics
Permanent URI for this collection
The title of the CS4984 Special Topics class can change from year to year, for example, Computational Linguistics (2014) and Big Data Text Summarization (2018), and includes a graduate section, CS5984.
Browse
Browsing CS4984: Special Topics by Subject "abstractive summarization"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- Automatic Summarization of News Articles about Hurricane FlorenceWanye, Frank; Ganguli, Samit; Tuckman, Matt; Zhang, Joy; Zhang, Fangzheng (Virginia Tech, 2018-12-07)We present our approach for generating automatic summaries from a collection of news articles acquired from the World Wide Web relating to Hurricane Florence. Our approach consists of 10 distinct steps, at the end of which we produce three separate summaries using three distinct methods: 1. A template summary, in which we extract information from the web page collection to fill in blanks in a template. 2. An extractive summary, in which we extract the most important sentences from the web pages in the collection. 3. An abstractive summary, in which we use deep learning techniques to rephrase the contents of the web pages in the collection. The first six steps of our approach involve extracting important words, synsets, words constrained by part of speech, a set of discriminating features, important named entities, and important topics from the collection. This information is then used by the algorithms that generate the automatic summaries. To produce the template summary, we employed a modified version of the hurricane summary template provided to us by the instructor. For each blank space in the modified template, we used regular expression matching with selected keywords to filter out relevant sentences from the collection, and then a combination of regex matching and entity tagging to select the relevant information for filling in the blanks. Most values also required unit conversion to capture all values from the articles, not just values of a specific unit. Numerical analysis was then performed on these values to either get the mode or the mean from the set, and for some values such as rainfall the standard deviation was then used to estimate the maximum. To produce the extractive summary, we employed existing extractive summarization libraries. In order to synthesize information from multiple articles, we use an iterative approach, concatenating generated summaries, and summarizing the concatenated summaries. To produce the abstractive summary, we employed existing deep learning summarization techniques. In particular, we used a pre-trained Pointer-Generator neural network model. Similarly to the extractive summary, we cluster the web pages in the collection by topic, before running them through the neural network model, to reduce the amount of repeated information produced. Out of the three summaries that we generated, the template summary is the best overall due to its coherence. The abstractive and extractive summaries both provide a fair amount of information, but are severely lacking in organization and readability. Additionally, they provide specific details that are irrelevant to the hurricane. All three of the summaries could be improved with further data cleaning, and the template summary could be easily extended to include more information about the event so that it would be more complete.
- Big Data Text Summarization - Hurricane HarveyGeissinger, Jack; Long, Theo; Jung, James; Parent, Jordan; Rizzo, Robert (Virginia Tech, 2018-12-12)Natural language processing (NLP) has advanced in recent years. Accordingly, we present progressively more complex generated text summaries on the topic Hurricane Harvey. We utilized TextRank, which is an unsupervised extractive summarization algorithm. TextRank is computationally expensive, and the sentences generated by the algorithm aren’t always directly related or essential to the topic at hand. When evaluating TextRank, we found that a single sentence interjected and ruined the flow of the summary. We also found that ROUGE evaluation for our TextRank summary was quite low compared to a golden standard that was prepared for us. However, the TextRank summary had high marks for ROUGE evaluation compared to the Wikipedia article lead for Hurricane Harvey. To improve upon the TextRank algorithm, we utilized template summarization with named entities. Template summarization takes less time to run than TextRank but is supervised by the author of the template and script to choose valuable named entities. Thus, it is highly dependent on human intervention to produce reasonable and readable summaries that aren’t error-prone. As expected, the template summary evaluated well compared to the Gold Standard and the Wikipedia article lead. This result is mainly due to our ability to include named entities we thought were pertinent to the summary. Beyond extractive summaries like TextRank and template summarization, we pursued abstractive summarization using pointer-generator networks and multi-document summarization with pointer-generator networks and maximal marginal relevance. The benefit of using abstractive summarization is that it is more in-line with how humans summarize documents. Pointer-generator networks, however, require GPUs to run properly and a large amount of training data. Luckily, we were able to use a pre-trained network to generate summaries. The pointer-generator network is the centerpiece of our abstractive methods and allowed us to create summaries in the first place. NLP is at an inflection point due to deep learning, and our generated summaries using a state-of-the-art pointer-generator neural network are filled with details about Hurricane Harvey, including damage incurred, the average amount of rainfall, and the locations it affected the most. The summary is also free of grammatical errors. We also use a novel Python library, written by Logan Lebanoff at the University of Central Florida, for multi-document summarization using deep learning to summarize our Hurricane Harvey dataset of 500 articles and the Wikipedia article for Hurricane Harvey. The summary of the Wikipedia article is our final summary and has the highest ROUGE scores that we could attain.
- Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar EclipseHamilton, Leah; Robb, Esther; Fitzpatrick, April; Goel, Akshay; Nandigam, Ramya (Virginia Tech, 2018-12-13)Summarization is often a time-consuming task for humans. Automated methods can summarize a larger volume of source material in a shorter amount of time, but creating a good summary with these methods remains challenging. This submission contains all work related to a semester-long project in CS 4984/5984 to generate the best possible summary of a collection of 10,829 web pages about the Facebook-Cambridge Analytica data breach, with some early prototyping done on 500 web pages about the 2017 Solar Eclipse. A final report, a final presentation, and several archives of code, input data, and results are included. The work implements basic natural language processing techniques such as word frequency, lemmatization, and part-of-speech tagging, working up to a complete human-readable summary at the end of the course. Extractive, abstractive, and combination methods were used to generate the final summaries, all of which are included and the results compared. The summary subjectively evaluated as best was a purely extractive summary built from concatenating summaries of document categories. This method was coherent and thorough, but involved manual tuning to select categories and still had some redundancy. All attempted methods are described and the less successful summaries are also included. This report presents a framework for how to summarize complex document collections with multiple relevant topics. The summary itself identifies information which was most covered about the Facebook-Cambridge Analytica data breach and is a reasonable introduction to the topic.