CS4984: Special Topics
Permanent URI for this collection
The title of the CS4984 Special Topics class can change from year to year, for example, Computational Linguistics (2014) and Big Data Text Summarization (2018), and includes a graduate section, CS5984.
Browse
Browsing CS4984: Special Topics by Issue Date
Now showing 1 - 20 of 21
Results Per Page
Sort Options
- Computational Linguistics Hurricane GroupCrowder, Nicholas; Nguyen, David; Hsu, Andy; Mecklenburg, Will; Morris, Jeff (2014-12)The problem-project based learning described in our presentation and report addresses automatic summarization of web content using natural language processing. Initially, we used simple techniques such as word frequencies and WordNet along with n-grams to create summaries. Further approaches became more complex due to the introduction of tools such as Mahout and k-means for topics and clustering. This finally culminated in the use of custom templates and a grammar to generate English sentences to accurately summarize a corpus. Our English summary was created using a grammar alongside regular expressions to extract information. The previous units all built up to the construction of quality regular expressions, in addition to a clean dataset, and some extra tools, such as a classifier trained on our data, as well as a part-of-speech tagger.
- OutbreakSum: Automatic Summarization of Texts Relating to Disease OutbreaksGruss, Richard; Morgado, Daniel; Craun, Nate; Shea-Blymyer, Colin (2014-12)The goal of the fall 2014 Disease Outbreak Project (OutbreakSum) was to develop software for automatically analyzing and summarizing large collections of texts pertaining to disease outbreaks. Although our code was tested on collections about specific diseases--a small one about Encephalitis and a large one about Ebola--most of our tools would work on texts about any infectious disease, where the key information relates to locations, dates, number of cases, symptoms, prognosis, and government and healthcare organization interventions. In the course of the project, we developed a code base that performs several key Natural Language Processing (NLP) functions. Some of the tools that could potentially be useful for other Natural Language Generation (NLG) projects include: 1. A framework for developing MapReduce programs in Python that allows for local running and debugging; 2. Tools for document collection cleanup procedures such as small-file removal, duplicate-file removal (based on content hashes), sentence and paragraph tokenization, nonrelevant file removal, and encoding translation; 3. Utilities to simplify and speed up Named Entity Recognition with Stanford NER by using the Java API directly; 4. Utilities to leverage the full extent of the Stanford CoreNLP library, which include tools for parsing and coreference resolution; 5. Utilities to simplify using the OpenNLP Java library for text processing. By configuring and running a single Java class, you can use OpenNLP to perform part-of-speech tagging and named entity recognition on your entire collection in minutes. We’ve classified the tools available in OutbreakSum into four major modules: 1. Collection Processing; 2. Local Language Processing; 3. MapReduce with Apache Hadoop; 4. Summarization.
- Computational Linguistic Analysis of Earthquake CollectionsBialousz, Kenneth; Kokal, Kevin; Orleans-Pobee, Kwamina; Wakeley, Christopher (2014-12)CS4984 is a newly-offered class at Virginia Tech with a unit based, project-problem based learning curriculum. This class style is based on NSF-funded work on curriculum for the field of digital libraries and related topics, and in this class, is used to guide a student based investigation of computational linguistics. The specific problem this report addresses is the creation of a means to automatically generate a short summary of a corpus of articles about earthquakes. Such a summary should be best representative of the texts and include all relevant information about earthquakes. For our analysis, we operated on two corpora--one about a 5.8 magnitude earthquake in Virginia in August 2011, and another about a 6.6 magnitude earthquake in April 2013 in Lushan, China. Techniques used to analyze the articles include clustering, lemmatization, frequency analysis of n-grams, and regular expression searches.
- Exploring the Blacksburg Community Events CollectionAntol, Stanislaw; Ayoub, Souleiman; Folgar, Carlos; Smith, Steve (2014-12)With the advent of new technology, especially the combination of smart phones and widespread Internet access, people are increasingly becoming absorbed in digital worlds – worlds that are not bounded by geography. As such, some people worry about what this means for local communities. The Virtual Town Square project is an effort to harness people's use of these kinds of social networks, but with a focus on local communities. As part of the Fall 2014 CS4984 Computational Linguistics course, we explored a collection of documents, the Blacksburg Events Collection, that were mined from the Virtual Town Square for the town of Blacksburg, Virginia. We describe our activities to summarize this collection to inform newcomers about the local community. We begin by describing the approach that we took, which consisted of first cleaning our dataset and then applying the idea of Hierarchical Clustering to our collection. The core idea is to cluster the documents of our collection into sub-clusters, then cluster those sub-clusters, and then finally do sub-clustering on the sentences of the final sub-clusters. We then choose the sentences closest to the final sentence sub-cluster centroids as our summaries. Some of the summary sentences capture very relevant information about specific events in the community, but our final results still have a fair bit of noise and are not very concise. We then discuss some of the lessons that we learned throughout the course of the project, such as the importance of good project planning and quickly iterating on actual solutions instead of just discussing the multitude of approaches that can be taken. We then provide suggestions to improve upon our approach, especially ways to clean up the final sentence summaries. The appendix also contains a Developer’s Manual that describes the included files and the final code in detail.
- Generating an Intelligent Human-Readable Summary of a Shooting Event from a Large Collection of WebpagesChandrasekaran, Arjun; Sharma, Saurav; Sulucz, Peter; Tran, Jonathan (2014-12)We describe our approach to generating summaries of a shooting event from a large collection of webpages. We work with two separate events - a shooting at a school in Newtown, Connecticut and another at a mall in Tucson, Arizona. Our corpora of webpages are inherently noisy and contain a large amount of irrelevant information. In our approach, we attempt to clean up our webpage collection by removing all irrelevant content. For this, we utilize natural language processing techniques such as word frequency analysis, part of speech tagging and named entity recognition to identify key words about our news events. Using these key words as features, we employ classification techniques to categorize each document as relevant or irrelevant. We discard the documents classified as irrelevant. We observe that to generate a summary, we require some specific information that enables us to answer important questions such as "Who was the killer?", "Where did the shooting happen?", "How many casualties were there?" and so on. To enable extraction of these essential details from news articles, we design a template of the event summary with slots that pertain to information we would like to extract. We designed regular expressions to identify a number of 'candidate' values for the template slots. Using a combination of word frequency analysis and specific validation techniques, we choose the top candidate for each slot of our template. We use a grammar based on our template to generate a human readable summary of each event. We utilize the Hadoop MapReduce framework to parallelize our workflow, along with the NLTK language processing library to simplify and speed our development. We learned that a variety of different methods and techniques are necessary in order to provide an accurate summary for any collection. It is seen that cleaning poses an incredibly difficult yet necessary task when attempting to semantically interpret data. We found that our attempts to extract relevant topics and sentences using the topic extraction method Latent Dirichlet Allocation and k-means clustering did not result in topics and sentences that were indicative of our corpus. We demonstrate an effective way of summarizing a shooting event that extracts relevant information by using regular expressions and generates a comprehensive human-readable summary utilizing a regular grammar. Our solution generates a summary that includes key information needed in understanding a shooting event such as: the shooter(s), date of the shooting, location of the shooting, number of people injured and wounded, and the weapon used. This solution is shown to work effectively for two different types of shootings: a mass murder, and an assassination attempt.
- Natural Language Processing: Generating a Summary of Flood DisastersAcanfora, Joseph; Evangelista, Marc; Keimig, David; Su, Myron (2014-12)In the event of a natural disaster like a flood, news outlets are in a rush to produce coverage for the general public. People may want a clear, concise summary of the event without having to read through hundreds of documents describing the event in different ways. The report of our work describes how to use computation techniques in Natural Language Processing (NLP) to automatically generate a summary on an instance of a flood event given a collection of diverse text documents. The body of this document covers NLP topics and techniques utilizing the NLTK Python library and Apache Hadoop to analyze and summarize a corpus. While this document describes the usage of such tools, it does not give an in-depth explanation of how these tools work, but rather focuses on their application to generating a summary of a flood event.
- Summarizing Fire Events with Natural Language ProcessingPlahn, Jordan; Zamani, Michael; Lee, Hayden; Trujillo, Michael (2014-12)Throughout this semester, we were driven by one question: how do we best summarize a fire with articles scraped from the internet? We took a variety of approaches to answer it, incrementally constructing a solution to summarize our events in a satisfactory manner. We needed a considerable amount of data to process. This data came in the form of two separate corpora: one involving the Bastrop County, Texas wildfires of 2011 and the other the Kiss nightclub fire of 2013 in Santa Maria, Brazil. For our “small” collection, the Texas wildfires, we had approximately 16,000 text files. For our “large” collection, the nightclub fire, we had approximately 690,000 text files. Theoretically, each text file contained a single news article relating to the event. In reality, this was rarely true. As a result, we had to perform considerable preprocessing of our corpora to ensure useful outcomes. The incremental steps to produce our final summary took the form of 9 units to be completed over the course of the semester, with each building on the work of the previous unit. Owing to our lack of domain knowledge at the beginning of the semester (with either fires or natural language processing), we were provided considerable guidance to produce naive, albeit useful, initial solutions. In the first few units, we summarized our collections with brute force approaches: choosing the most frequent words as descriptors, manually generating words to describe the collection, selecting descriptive lemmas, and more. Most of these approaches are characterized by arbitrarily selecting descriptors based on frequency alone, with little consideration for the underlying linguistic significance. From this, we transitioned to more intelligent approaches, attempting to utilize more fine grained techniques to remove extraneous information. We incorporated part-of-speech (POS) tagging to determine the speech type of a word, which allows us to select the most important nouns, for example. Using POS tagging, as well as an ever expanding stopword list, allowed us to remove much of the uninformative results. To further improve our collection, we needed a way to filter out more than just stopwords. In our case, we had many text files that were unrelated to corpus topics, which could corrupt or skew our results. To accomplish this, we built a document classifier to determine if articles are relevant and mark them appropriately, allowing us to include only the relevant articles in our processing. Despite this, our collection still suffered from considerable noise. In almost all of our units we employed various “big data” techniques and tools, including MapReduce and Mahout. These tools allowed us to process extremely large collections of data in an efficient manner. With these tools we could select the most relevant names, topics, and sentences, providing the framework for a summary of the entire collection. It is these insights that lead us to the final two sections of producing a summarization based on preconstructed templates of our events. Using a mixture of every technique we had learned we constructed paragraphs that summarized both fires we had in our collections. For the final two units of our course, we were tasked with creating a paragraph summary of both the Texas Wildfire and the Brazil Nightclub Fire events. We began with a generic fire event template with a set of attributes that would be filled in with the best results we could extract. We made the decision early on to create separate templates for the more specific fire event types of wildfires and building fires, as there are some details which do not overlap among the two event types. In order to fill in our templates we created a process of extracting, refining and finally filling in our gathered results. In order to extract data from our corpora, we created a regular expression for each attribute type and stored any matches found. Next, using only the top 10 results for each attribute, we filtered results by part of speech, constructed a simple grammar to modify the template according to our selected result, and conjugated any present tense verbs to past tense.
- Big Data Text Summarization: Using Deep Learning to Summarize Theses and DissertationsAhuja, Naman; Bansal, Ritesh; Ingram, William A.; Jude, Palakh; Kahu, Sampanna; Wang, Xinyue (Virginia Tech, 2018-12-05)Team 16 in the fall 2018 course "CS 4984/5984 Big Data Text Summarization," in partnership with the University Libraries and the Digital Library Research Laboratory, prepared a corpus of electronic theses and dissertations (ETDs) for students to study natural language processing with the power of state-of-the-art deep learning technology. The ETD corpus is made up of 13,071 doctoral dissertations and 17,890 master theses downloaded from the University Libraries’ VTechWorks system. This particular study is designed to explore big data summarization for ETDs, which is a relatively under-explored area. The result of the project will help to address the difficulty of information extraction from ETD documents, the potential of transfer learning on automatic summarization of ETD chapters, and the quality of state-of-the-art deep learning summarization technologies when applied to the ETD corpus. The goal of this project is to generate chapter level abstractive summaries for an ETD collection through deep learning. Major challenges of the project include accurately extracting well-formatted chapter text from PDF files, and the lack of labeled data for supervised deep learning models. For PDF processing, we compare two state of the art scholarly PDF data extraction tools, Grobid and Science-Parse, which generate structured documents from which we can further extract metadata and chapter level text. For the second challenge, we perform transfer learning by training supervised learning models on a labeled dataset of Wikipedia articles related to the ETD collection. Our experimental models include Sequence-to-Sequence and Pointer Generator summarization models. Besides supervised models, we also experiment with an unsupervised reinforcement model, Fast Abstractive Summarization-RL. The general pipeline for our experiments consists of the following steps: PDF data processing and chapter extraction, collecting a training data set of Wikipedia articles, manually creating human generated gold standard summaries for testing and validation, building deep learning models for chapter summarization, evaluating and tuning the models based on results, and then iteratively refining the whole process.
- Automatic Summarization of News Articles about Hurricane FlorenceWanye, Frank; Ganguli, Samit; Tuckman, Matt; Zhang, Joy; Zhang, Fangzheng (Virginia Tech, 2018-12-07)We present our approach for generating automatic summaries from a collection of news articles acquired from the World Wide Web relating to Hurricane Florence. Our approach consists of 10 distinct steps, at the end of which we produce three separate summaries using three distinct methods: 1. A template summary, in which we extract information from the web page collection to fill in blanks in a template. 2. An extractive summary, in which we extract the most important sentences from the web pages in the collection. 3. An abstractive summary, in which we use deep learning techniques to rephrase the contents of the web pages in the collection. The first six steps of our approach involve extracting important words, synsets, words constrained by part of speech, a set of discriminating features, important named entities, and important topics from the collection. This information is then used by the algorithms that generate the automatic summaries. To produce the template summary, we employed a modified version of the hurricane summary template provided to us by the instructor. For each blank space in the modified template, we used regular expression matching with selected keywords to filter out relevant sentences from the collection, and then a combination of regex matching and entity tagging to select the relevant information for filling in the blanks. Most values also required unit conversion to capture all values from the articles, not just values of a specific unit. Numerical analysis was then performed on these values to either get the mode or the mean from the set, and for some values such as rainfall the standard deviation was then used to estimate the maximum. To produce the extractive summary, we employed existing extractive summarization libraries. In order to synthesize information from multiple articles, we use an iterative approach, concatenating generated summaries, and summarizing the concatenated summaries. To produce the abstractive summary, we employed existing deep learning summarization techniques. In particular, we used a pre-trained Pointer-Generator neural network model. Similarly to the extractive summary, we cluster the web pages in the collection by topic, before running them through the neural network model, to reduce the amount of repeated information produced. Out of the three summaries that we generated, the template summary is the best overall due to its coherence. The abstractive and extractive summaries both provide a fair amount of information, but are severely lacking in organization and readability. Additionally, they provide specific details that are irrelevant to the hurricane. All three of the summaries could be improved with further data cleaning, and the template summary could be easily extended to include more information about the event so that it would be more complete.
- Big Data Text Summarization for the NeverAgain MovementArora, Anuj; Miller, Chreston; Fan, Jixiang; Liu, Shuai; Han, Yi (Virginia Tech, 2018-12-10)When you are browsing social media websites such as Twitter and Facebook, have you ever seen hashtags like #NeverAgain and #EnoughIsEnough? Do you know what they mean? Never Again is an American student-led political movement for gun control to prevent gun violence. In the United States, gun control has long been debated. According to the data from the Gun Violence Archive (http://www.shootingtracker.com/), in 2017, the U.S. saw a total of 346 mass shootings. Supporters claim that the proliferation of firearms is the direct spark of a series of social unrest factors such as robbery, sexual crimes, and theft, while others believe the gun culture represents an integral part of their freedom. For the Never Again Gun Control Movement, we would like to generate a human readable summary based on deep learning methods so that one can study incidents of gun violence that shocked the world such as the 2017 Las Vegas shooting, in order to figure out the impact of gun proliferation. Our project includes three steps: pre-processing, topic modeling, and abstractive summarization using deep learning. We began with a large collection of news articles associated with the #NeverAgain movement. The raw news articles needed to be pre-processed in multiple ways. An ArchiveSpark script was used to convert the WARC and CDX files to a readable and parseable JSON. However, we figured out that at least forty percent of the data was noise. A series of restrictive word filters was applied to remove noise. After noise removal, we identified the most frequent words to get a preliminary idea whether we were filtering noise properly. We used the Natural Language Toolkit’s (NLTK) Named Entity chunker to generate named entities, which are phrases that form important nouns (people, places, organizations, etc.) in a sentence. For Topic Modeling, we classified sentences into different buckets or topics, which identified distinct themes in the collection. While we were performing the dictionary creation and document vectorization, the Latent Dirichlet allocation algorithm (for topic modeling) did not take the normalized and tokenized word corpus directly. It had to be converted into a vector for each article in the collection. We chose to use the Bag Of Words (BOW) approach. The Bag Of Words method is a simplifying representation used in natural language processing and information retrieval. In this model, text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order, but keeping multiplicity. According to topic modeling, we needed to choose the number of topics, which means one must guess how many topics are present in a collection. There is no foolproof way of replacing human logic to weave keywords into topics with semantic meaning. To address this we tried the coherence score approach. Coherence score is an attempt to mimic the human readability of the topic, and the higher the coherence score, the more ”coherent” the topics are considered. The last step for topic modeling is Latent Dirichlet Allocation (LDA). Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. Compared with some other algorithms, LDA is a probabilistic one, which means that LDA is better at handling topic mixtures in different documents. In addition, LDA identifies topics coherently whereas the topics from other algorithms are more disjoint. After we had our topics (three in total), we filtered the article collection based on these topics. What resulted was three distinct collections of articles on which we could apply an abstractive summarization algorithm to produce a coherent summary. We chose to use a Pointer-Generator Network (PGN), a deep learning approach designed to create abstractive summaries, to produce said summaries. We created a summary for each identified topic and performed post-processing to produce one summary that connected the three topics (which are related) into a summary that flowed. The result was a summary that reflected the main themes of the article collection and informed the reader of the contents of said collection in less than two pages.
- Big Data Text Summarization - Hurricane HarveyGeissinger, Jack; Long, Theo; Jung, James; Parent, Jordan; Rizzo, Robert (Virginia Tech, 2018-12-12)Natural language processing (NLP) has advanced in recent years. Accordingly, we present progressively more complex generated text summaries on the topic Hurricane Harvey. We utilized TextRank, which is an unsupervised extractive summarization algorithm. TextRank is computationally expensive, and the sentences generated by the algorithm aren’t always directly related or essential to the topic at hand. When evaluating TextRank, we found that a single sentence interjected and ruined the flow of the summary. We also found that ROUGE evaluation for our TextRank summary was quite low compared to a golden standard that was prepared for us. However, the TextRank summary had high marks for ROUGE evaluation compared to the Wikipedia article lead for Hurricane Harvey. To improve upon the TextRank algorithm, we utilized template summarization with named entities. Template summarization takes less time to run than TextRank but is supervised by the author of the template and script to choose valuable named entities. Thus, it is highly dependent on human intervention to produce reasonable and readable summaries that aren’t error-prone. As expected, the template summary evaluated well compared to the Gold Standard and the Wikipedia article lead. This result is mainly due to our ability to include named entities we thought were pertinent to the summary. Beyond extractive summaries like TextRank and template summarization, we pursued abstractive summarization using pointer-generator networks and multi-document summarization with pointer-generator networks and maximal marginal relevance. The benefit of using abstractive summarization is that it is more in-line with how humans summarize documents. Pointer-generator networks, however, require GPUs to run properly and a large amount of training data. Luckily, we were able to use a pre-trained network to generate summaries. The pointer-generator network is the centerpiece of our abstractive methods and allowed us to create summaries in the first place. NLP is at an inflection point due to deep learning, and our generated summaries using a state-of-the-art pointer-generator neural network are filled with details about Hurricane Harvey, including damage incurred, the average amount of rainfall, and the locations it affected the most. The summary is also free of grammatical errors. We also use a novel Python library, written by Logan Lebanoff at the University of Central Florida, for multi-document summarization using deep learning to summarize our Hurricane Harvey dataset of 500 articles and the Wikipedia article for Hurricane Harvey. The summary of the Wikipedia article is our final summary and has the highest ROUGE scores that we could attain.
- Summarization of Maryland Shooting CollectionKhawas, Prapti; Banerjee, Bipasha; Zhao, Shuqi; Fan, Yiyang; Kim, Yoonjin (Virginia Tech, 2018-12-12)The goal of this work is to generate summaries of two Maryland shooting events from a large collection of web pages related to a shooting at Great Mills High School and another at the Capital Gazette newsroom. Since our team did not have prior experience with Computational Linguistics / Natural Language Processing (NLP), we followed an approach where we built summaries using 10 different methods, as suggested by course instructor Dr. Edward Fox, with each method being more sophisticated than the previous ones, to enable learning of key concepts in NLP. First, we started with finding a set of most frequent important words. Then, we found other words occurring in the articles which mean the same as the frequent words found. Along with the synonyms, we found sets of hypernyms and hyponyms. We identified a set of words constrained by POS, e.g., nouns and verbs. We then tried out various classification techniques in Apache Mahout to classify the documents into the two different events and eliminate irrelevant documents. Next, we identified a set of frequent and important named entities using NLTK and SpaCy Named Entity Recognition (NER) modules. We identified a set of important topics identified using Latent Dirichlet Allocation (LDA). We then generated clusters of documents using K-means. Next, we extracted a set of values for each slot matching collection semantics using regular expressions and generated a readable summary explaining the slots and values using a Context Free Grammar we developed. Finally, we used the Pointer Generator deep learning approach to generate a readable abstractive summary. Using the above approach, we generated two extractive summaries for newsroom shooting event and school shooting event with ROUGE-1 scores around 0.33 and 0.26 respectively. For the abstractive summaries, that we generated, the ROUGE-1 score was 0.36 for newsroom shooting event and 0.20 for school shooting event. We also evaluated the summaries at sentence level and we found that the abstractive school shooting summary had a higher ROUGE-1 score, being 0.88, than abstractive newsroom shooting summary with 0.73. We employed the Hadoop MapReduce framework to speed up the processing time for our large collection. We used various other tools like the NLTK language processing library and Apache Mahout, a distributed linear algebra framework to simplify our development. We learned that a variety of different methods and techniques which suit the collection are necessary in order to provide an accurate summary. We also learned the importance of cleaning the collection and challenges in the task.
- Abstractive Text Summarization of the Parkland Shooting CollectionKingery, Ryan; Yellapantula, Sudha Ravali; Xu, Chao; Huang, Li Jun; Ye, Jiacheng (Virginia Tech, 2018-12-12)We analyze various ways to perform abstractive text summarization on an entire collection of news articles. We specifically seek to summarize the collection of web-archived news articles relating to the 2018 shooting at Marjory Stoneman Douglas High School in Parkland, Florida. The original collection contains about 10,100 archived web pages that mostly relate to the shooting, which after pre-processing reduces to about 3,900 articles that directly relate to the shooting. We then explore several ways to generate abstractive summaries for the collection using deep learning methods. Since current deep learning methods for abstract summarization are only capable of summarizing text at the single-article level or below, to perform summarization on our collection, we identify a set of representative articles from the collection, summarize each of those articles using our deep learning models, and then concatenate those summaries together to produce a summary for the entire collection. To identify the representative articles to summarize we investigate various unsupervised methods to partition the space of articles into meaningful groups. We try choosing these articles by random sampling from the collection, by using topic modeling, and by sampling from clusters obtained from clustering on Doc2Vec embeddings. To summarize each individual article we explore various state of the art deep learning methods for abstractive summarization: a sequence-to-sequence model, a pointer generator network, and a reinforced extractor-abstractor network. To evaluate the quality of our summaries we employ two methods. The first is a subjective method, where each person subjectively ranked the quality of each summary. The second is an objective method which used various ROUGE metrics to compare each summary to an independently-generated gold standard summary. We found that most ROUGE scores were pretty low overall, with only the pointer-generator network on random articles picking up a ROUGE score above 0.15. This suggests that such deep learning techniques still have a lot of room for improvement if they are to be viable for collection summarization.
- Generating Text Summaries for the Facebook Data Breach with Prototyping on the 2017 Solar EclipseHamilton, Leah; Robb, Esther; Fitzpatrick, April; Goel, Akshay; Nandigam, Ramya (Virginia Tech, 2018-12-13)Summarization is often a time-consuming task for humans. Automated methods can summarize a larger volume of source material in a shorter amount of time, but creating a good summary with these methods remains challenging. This submission contains all work related to a semester-long project in CS 4984/5984 to generate the best possible summary of a collection of 10,829 web pages about the Facebook-Cambridge Analytica data breach, with some early prototyping done on 500 web pages about the 2017 Solar Eclipse. A final report, a final presentation, and several archives of code, input data, and results are included. The work implements basic natural language processing techniques such as word frequency, lemmatization, and part-of-speech tagging, working up to a complete human-readable summary at the end of the course. Extractive, abstractive, and combination methods were used to generate the final summaries, all of which are included and the results compared. The summary subjectively evaluated as best was a purely extractive summary built from concatenating summaries of document categories. This method was coherent and thorough, but involved manual tuning to select categories and still had some redundancy. All attempted methods are described and the less successful summaries are also included. This report presents a framework for how to summarize complex document collections with multiple relevant topics. The summary itself identifies information which was most covered about the Facebook-Cambridge Analytica data breach and is a reasonable introduction to the topic.
- Big Data Text Summarization - Hurricane IrmaChava, Raja Venkata Satya Phanindra; Dhar, Siddharth; Gaur, Yamini; Rambhakta, Pranavi; Shetty, Sourabh (Virginia Tech, 2018-12-13)With the increased rate of content generation on the Internet, there is a pressing need for making tools to automate the process of extracting meaningful data. Big data analytics deals with researching patterns or implicit correlations within a large collection of data. There are several sources to get data from, such as news websites, social media platforms (for example FaceBook and Twitter), sensors, and other IoT (Internet of Things) devices. Social media platforms like Twitter prove to be important sources of data collection since the level of activity increases significantly during major events such as hurricanes, floods, and events of global importance. For generating summaries, we first had to convert the WARC file which was given to us, into JSON format, which was more understandable. We then cleaned the text by removing boilerplate and redundant information. After that, we proceeded with removing stopwords and getting a collection of the most important words occurring in the documents. This ensured that the resulting summary would have important information from our corpus and would still be able to answer all the questions. One of the challenges that we faced at this point was to decide how to correlate words in order to get the most relevant words out of a document. We tried several techniques such as TF-IDF in order to resolve this. Correlation of different words with each other is an important factor in generating a cohesive summary because while a word may not be in the list of most commonly occurring words in the corpus, it could still be relevant and give significant information about the event. Due to the occurrence of Hurricane Irma around the same time as the occurrence of Hurricane Harvey, a large number of documents were not about Hurricane Irma. Due to this, all such documents were eliminated as they were deemed non-relevant. Classification of documents as relevant or non-relevant ensured that our deep learning summaries were not getting generated on data that was not crucial in building our final summary. Initially, we attempted to use Mahout classifiers, but the results obtained were not satisfactory. Instead, we used a much simpler world filtering approach for classification which has eliminated a significant number of documents by classifying them as non-relevant. We used the Pointer-Generator technique, which implements a Recurrent Neural Network (RNN) for building the deep learning abstractive summary. We combined data from multiple relevant documents into a single document, and thus generated multiple summaries, each corresponding to a set of documents. We wrote a Python script to perform post-processing on the generated summary to convert all the alphabetic characters after a period and space to uppercase. This was important because for lemmatization, stopword removal, and POS tagging, the whole dataset is converted to lowercase. The script also converts the first alphabetic character of all POS-tagged proper nouns to upper case. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used to evaluate the generated summary against the golden standard summary. The abstractive summary returns good evaluation results when compared with the Golden Standard on the ROUGE_sent evaluation. The ROUGE_para and cov_entity evaluation results were not up to the mark, but we feel that was mainly due to the writing style of the Gold Standard as our abstractive summary was able provide most of the information related to Hurricane Irma.
- Hurricane Matthew SummarizationGoldsworthy, Michael; Tran, Thoang; Asif, Areeb; Gregos, Brendan (Virginia Tech, 2018-12-14)The report, presentation, and code for our project for the course CS 4984/5984: Big Data Text Summarization are included in this submission. Our team had to explore methods of text summarization for two datasets, and report on our findings. The report covers our methods. The report starts with information on cleaning the data and filtering unnecessary documents. It then describes simple tasks such as counting the most common and important words and counting words by their part of speech. Following this, the report focuses on intermediate tasks such as clustering and finding LDA topics. Finally it presents our best methods for summarization, i.e., template and extractive summarization. We describe the algorithms, motivations, and conclusions we drew from each of our attempts. The report also contains a user and developer guide for using and maintaining our code, as well as a description of the tools and libraries we used. At the end there is also the Gold Standard Summary that we manually generated for another team in the course, to be used as a comparison for their automatically generated summary. We evaluated our automatically generated summary against a gold standard prepared by team 2, and found that our extractive summary performed the best based on its ROUGE scores. The source code zip file contains the code used for the tasks described in the report. The code was written in Python, and can be run only after installing the dependencies listed in the User Manual section of the report. The presentation file has the slides from the final presentation, containing much of the information in the report in a greatly simplified form. An editable version of the LaTeX document used to create our final report, and the editable PPTX file from our final presentation, are also included.
- Hybrid Summarization of Dakota Access Pipeline Protests (NoDAPL)Chen, Xiaoyu; Wang, Haitao; Mehrotra, Maanav; Chhikara, Naman; Sun, Di (Virginia Tech, 2018-12-14)Dakota Access Pipeline Protests (known with the hashtag #NoDAPL) are grassroots movements that began in April 2016 in reaction to the approved construction of Energy Transfer Partners’ Dakota Access Pipeline in the northern United States. The NoDAPL movements produce many FaceBook messages, tweets, blogs, and news, which reflect different aspects of the NoDAPL events. The related information keeps increasing rapidly, which makes it difficult to understand the events in an efficient manner. Therefore, it is invaluable to automatically or at least semi-automatically generate short summaries based on the online available big data. Motivated by this automatic summarization need, the objective of this project is to propose a novel automatic summarization approach to efficiently and effectively summarize the topics hidden in the online big text data. Although automatic summarization has been investigated for more than 60 years since the publication of Luhn’s 1958 seminal paper, several challenges exist in summarizing online big text sets, such as large proportion of noise texts, highly redundant information, multiple latent topics, etc. Therefore, we propose an automatic framework with minimal human efforts to summarize big online text sets (~11,000 documents on NoDAPL) according to latent topics with nonrelevant information removed. This framework provides a hybrid model to combine the advantages of latent Dirichlet allocation (LDA) based extractive and deep-learning based abstractive methods. Different from semi-automatic summarization approaches such as template-based summarization, the proposed method does not require a deep understanding of the events from the practitioners to create the template nor to fill in the template by using regular expressions. During the procedure, the only human effort needed is to manually label a few (say, 100) documents as relevant and irrelevant. We evaluate the quality of the generated automatic summary with both extrinsic and intrinsic measurement. In the extrinsic subjective evaluation, we design a set of guideline questions and conduct a task-based measurement. Results show that 91.3% of sentences are within the scope of the guideline, and 69.6% of the outlined questions can be answered by reading the generated summary. The intrinsic ROUGE measurements show our entity coverage is a total of 2.6% and ROUGE L and ROUGE SU4 scores are 0.148 and 0.065. Overall, the proposed hybrid model achieves decent performance on summarizing NoDAPL events. Future work includes testing of the approach with more textual datasets for interesting topics, and investigation of topic modeling-supervised classification approach to minimize human efforts in automatic summarization. Besides, we also would like to investigate a deep learning-based recommender system for better sentence re-ranking.
- Big Data Text Summarization - Attack WestminsterGallagher, Colm; Dyer, Jamie; Liebold, Jeanine; Becker, Aaron; Yang, Limin (Virginia Tech, 2018-12-14)Automatic text summarization, a process of distilling the most important information from a text document, is to create an abridged summary with software. Basically, in this task, we can regard the "summarization" as a function which takes a single document or multiple documents as an input and has the summary as an output. There are two ways that we can manage to create a summary: extractive and abstractive. The extractive summarization means that we select the most relevant sentences from the input and concatenate them to form a summary. Graph-based algorithm like TextRank, Feature-based models like TextTeaser, Topic-based models like Latent Semantic Analysis (LSA), and Grammar-based models could be viewed as approaches to extractive summarization. Abstractive summarization aims to create a summary similar to humans. It keeps the original intent, but uses new phrases and words not found in the original text. One of the most commonly used models is the encoder-decoder model, a neural network model that is mainly used in machine translation tasks. Recently, there is another combination approach that combines both extractive and abstractive summarization, like Pointer-Generator Network, and the Extract then Abstract model. In this course, we're given both a small dataset (about 500 documents) and a big dataset (about 11,300 documents) that mainly consist of web archives about a specific event. Our group is focusing on reports about a terrorist event -- Attack Westminster. It occurred outside the Palace of Westminster in London on March 22, 2017. The attacker, 52 year-old Briton Khalid Masood, drove a car into pedestrians on the pavement, injuring more than 50 people, 5 of them fatally. The attack was treated as "Islamist-related terrorism". We first created a Solr index for both the small dataset and the big dataset, which helped us to perform various queries to know more about the data. Additionally, the index aided another team to create a gold standard summary of our dataset for us. Then we gradually delved into different concepts and topics about text summarization, as well as natural language processing. Specifically, we managed to utilize the NLTK library and the spaCy package to create a set of most frequent important words, WordNet synsets that cover the word, words constrained by part of speech (POS), and frequent and important named entities. We also applied the LSA model to retrieve the most important topics. By clustering the dataset with k-means clustering, and selecting important sentences from the clusters using an implementation of the TextRank algorithm, we were able to generate a multi-paragraph summary. With the help of named entity recognition and pattern-based matching, we confidently extracted information like the name of the attacker, date, location, nearby landmarks, the number killed, the number injured, and the type of the attack. We then drafted a template of a readable summary to fill in the slots and values. Each of these results individually formed a summary that captures the most important information of the Westminster Attack. The most successful results were obtained using the extractive summarization method (k-means clustering and TextRank), the slot-value method (named entity recognition and pattern-based matching), and the abstractive summarization method (deep learning). We evaluated each of the summaries obtained using a combination of ROUGE metrics as well as named-entity coverage compared to the gold standard summary created by team 3. Overall, the best summary was obtained using the extractive summarization method, with both ROUGE metrics and named-entity coverage outperforming other methods.
- Big Data: New Zealand Earthquakes SummaryBochel, Alexander; Edmisten, William; Lee, Jun; Chandalura, Rohit (Virginia Tech, 2018-12-14)The purpose of this Big Data project was to create a computer generated text summary of a major earthquake event in New Zealand. The summary was to be created from a large webpage dataset supplied for our team. This dataset contained 280MB of data. Our team used basic and advanced machine learning techniques in order to create the computer generated summary. The research behind finding an optimal way to create such summaries is important because it allows us to analyze large sets of textual information and to identify the most important parts. It takes a human a long time to write an accurate summary and may even be impossible with the number of documents in our dataset. The use of computers to do this automatically drastically increases the rate at which important information can be extracted from a set of data. The process our team followed to achieve our results is as follows. First, we extracted the most frequently appearing words in our dataset. Our second step was to examine these words and to tag them with their part of speech. The next step our team took was to find and examine the most frequent named entities. Our team then improved our set of important words through TF-IDF vectorization. The prior steps were then repeated with the improved set of words. Next our team focused on creating an extractive summary. Once we completed this step, we used templating to create our final summary. Our team had many interesting findings throughout this process. Our discoveries were as follows. We learned how to effectively use Zeppelin notebooks as a tool for prototyping code. We discovered an efficient way to run our large datasets using the Hadoop cluster along with PySpark. We discovered how to effectively clean our dataset prior to running our programs with it. We also discovered how to create the extractive summary using a template along with our important named entities. Our final result was achieved using the templating method together with abstractive summarization. Our final result included a successful generation of an extractive summary using the templating system. This result was readable and accurate according to the dataset that we were given. We also achieved decent results from the extractive summary technique. These techniques provided mostly readable summaries but still included some noise. Since our templated summary was very specific it is the most coherent and contains only relevant information.
- CS4984/CS5984: Big Data Text Summarization Team 10 ETDsBaghudana, Ashish; Li, Guangchen; Liu, Beichen; Lasky, Stephen (Virginia Tech, 2018-12-14)Automatic text summarization is the task of creating accurate and succinct summaries of text documents. These documents can vary from newspaper articles to more academic content such as theses and dissertations. The two domains differ significantly in sentence structure and vocabulary, as well as in the length of the documents, with theses and dissertations being more verbose and using a very specialized vocabulary. Summarization techniques are broadly classified into extractive and abstractive styles - the former where salient sentences are extracted from the text without any modification and the latter where sentences are modified and paraphrased. Recent developments in neural networks, language modeling, and machine translation have spurred research into abstractive text summarization. Models developed recently are generally trained on news articles, specifically CNN and DailyMail, both of which have more readily available summaries available through public datasets. In this project, we apply recent deep-learning techniques of text summarization to produce summaries of electronic theses and dissertations from VTechWorks, Virginia Tech's online repository of scholarly work. We overcome the challenge posed by different vocabularies by creating a dataset of pre-print articles from ArXiv and training summarization models on these documents. The ArXiv collection consists of approximately 4500 articles, each of which has an abstract and the corresponding full text. For the purposes of training summarization models, we consider the abstract as the summary of the document. We split this dataset into a train, test, and validation set of 3155, 707, and 680 documents respectively. We also prepare gold standard summaries from chapters of electronic thesis and dissertations. Subsequently, we train pointer generator networks on the ArXiv dataset and evaluate the trained models using ROUGE scores. The ROUGE scores are reported on both the test split of the ArXiv dataset, as well as for the gold standard summaries. While the ROUGE scores do not indicate state-of-the-art performance, we do not find any equivalent work in summarization of academic content to compare against.