CS4984: Special Topics
Permanent URI for this collection
The title of the CS4984 Special Topics class can change from year to year, for example, Computational Linguistics (2014) and Big Data Text Summarization (2018), and includes a graduate section, CS5984.
Browse
Browsing CS4984: Special Topics by Subject "big data"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- Automatic Summarization of News Articles about Hurricane FlorenceWanye, Frank; Ganguli, Samit; Tuckman, Matt; Zhang, Joy; Zhang, Fangzheng (Virginia Tech, 2018-12-07)We present our approach for generating automatic summaries from a collection of news articles acquired from the World Wide Web relating to Hurricane Florence. Our approach consists of 10 distinct steps, at the end of which we produce three separate summaries using three distinct methods: 1. A template summary, in which we extract information from the web page collection to fill in blanks in a template. 2. An extractive summary, in which we extract the most important sentences from the web pages in the collection. 3. An abstractive summary, in which we use deep learning techniques to rephrase the contents of the web pages in the collection. The first six steps of our approach involve extracting important words, synsets, words constrained by part of speech, a set of discriminating features, important named entities, and important topics from the collection. This information is then used by the algorithms that generate the automatic summaries. To produce the template summary, we employed a modified version of the hurricane summary template provided to us by the instructor. For each blank space in the modified template, we used regular expression matching with selected keywords to filter out relevant sentences from the collection, and then a combination of regex matching and entity tagging to select the relevant information for filling in the blanks. Most values also required unit conversion to capture all values from the articles, not just values of a specific unit. Numerical analysis was then performed on these values to either get the mode or the mean from the set, and for some values such as rainfall the standard deviation was then used to estimate the maximum. To produce the extractive summary, we employed existing extractive summarization libraries. In order to synthesize information from multiple articles, we use an iterative approach, concatenating generated summaries, and summarizing the concatenated summaries. To produce the abstractive summary, we employed existing deep learning summarization techniques. In particular, we used a pre-trained Pointer-Generator neural network model. Similarly to the extractive summary, we cluster the web pages in the collection by topic, before running them through the neural network model, to reduce the amount of repeated information produced. Out of the three summaries that we generated, the template summary is the best overall due to its coherence. The abstractive and extractive summaries both provide a fair amount of information, but are severely lacking in organization and readability. Additionally, they provide specific details that are irrelevant to the hurricane. All three of the summaries could be improved with further data cleaning, and the template summary could be easily extended to include more information about the event so that it would be more complete.
- Big Data Text Summarization - Hurricane HarveyGeissinger, Jack; Long, Theo; Jung, James; Parent, Jordan; Rizzo, Robert (Virginia Tech, 2018-12-12)Natural language processing (NLP) has advanced in recent years. Accordingly, we present progressively more complex generated text summaries on the topic Hurricane Harvey. We utilized TextRank, which is an unsupervised extractive summarization algorithm. TextRank is computationally expensive, and the sentences generated by the algorithm aren’t always directly related or essential to the topic at hand. When evaluating TextRank, we found that a single sentence interjected and ruined the flow of the summary. We also found that ROUGE evaluation for our TextRank summary was quite low compared to a golden standard that was prepared for us. However, the TextRank summary had high marks for ROUGE evaluation compared to the Wikipedia article lead for Hurricane Harvey. To improve upon the TextRank algorithm, we utilized template summarization with named entities. Template summarization takes less time to run than TextRank but is supervised by the author of the template and script to choose valuable named entities. Thus, it is highly dependent on human intervention to produce reasonable and readable summaries that aren’t error-prone. As expected, the template summary evaluated well compared to the Gold Standard and the Wikipedia article lead. This result is mainly due to our ability to include named entities we thought were pertinent to the summary. Beyond extractive summaries like TextRank and template summarization, we pursued abstractive summarization using pointer-generator networks and multi-document summarization with pointer-generator networks and maximal marginal relevance. The benefit of using abstractive summarization is that it is more in-line with how humans summarize documents. Pointer-generator networks, however, require GPUs to run properly and a large amount of training data. Luckily, we were able to use a pre-trained network to generate summaries. The pointer-generator network is the centerpiece of our abstractive methods and allowed us to create summaries in the first place. NLP is at an inflection point due to deep learning, and our generated summaries using a state-of-the-art pointer-generator neural network are filled with details about Hurricane Harvey, including damage incurred, the average amount of rainfall, and the locations it affected the most. The summary is also free of grammatical errors. We also use a novel Python library, written by Logan Lebanoff at the University of Central Florida, for multi-document summarization using deep learning to summarize our Hurricane Harvey dataset of 500 articles and the Wikipedia article for Hurricane Harvey. The summary of the Wikipedia article is our final summary and has the highest ROUGE scores that we could attain.
- Big Data: New Zealand Earthquakes SummaryBochel, Alexander; Edmisten, William; Lee, Jun; Chandalura, Rohit (Virginia Tech, 2018-12-14)The purpose of this Big Data project was to create a computer generated text summary of a major earthquake event in New Zealand. The summary was to be created from a large webpage dataset supplied for our team. This dataset contained 280MB of data. Our team used basic and advanced machine learning techniques in order to create the computer generated summary. The research behind finding an optimal way to create such summaries is important because it allows us to analyze large sets of textual information and to identify the most important parts. It takes a human a long time to write an accurate summary and may even be impossible with the number of documents in our dataset. The use of computers to do this automatically drastically increases the rate at which important information can be extracted from a set of data. The process our team followed to achieve our results is as follows. First, we extracted the most frequently appearing words in our dataset. Our second step was to examine these words and to tag them with their part of speech. The next step our team took was to find and examine the most frequent named entities. Our team then improved our set of important words through TF-IDF vectorization. The prior steps were then repeated with the improved set of words. Next our team focused on creating an extractive summary. Once we completed this step, we used templating to create our final summary. Our team had many interesting findings throughout this process. Our discoveries were as follows. We learned how to effectively use Zeppelin notebooks as a tool for prototyping code. We discovered an efficient way to run our large datasets using the Hadoop cluster along with PySpark. We discovered how to effectively clean our dataset prior to running our programs with it. We also discovered how to create the extractive summary using a template along with our important named entities. Our final result was achieved using the templating method together with abstractive summarization. Our final result included a successful generation of an extractive summary using the templating system. This result was readable and accurate according to the dataset that we were given. We also achieved decent results from the extractive summary technique. These techniques provided mostly readable summaries but still included some noise. Since our templated summary was very specific it is the most coherent and contains only relevant information.
- Summarizing Fire Events with Natural Language ProcessingPlahn, Jordan; Zamani, Michael; Lee, Hayden; Trujillo, Michael (2014-12)Throughout this semester, we were driven by one question: how do we best summarize a fire with articles scraped from the internet? We took a variety of approaches to answer it, incrementally constructing a solution to summarize our events in a satisfactory manner. We needed a considerable amount of data to process. This data came in the form of two separate corpora: one involving the Bastrop County, Texas wildfires of 2011 and the other the Kiss nightclub fire of 2013 in Santa Maria, Brazil. For our “small” collection, the Texas wildfires, we had approximately 16,000 text files. For our “large” collection, the nightclub fire, we had approximately 690,000 text files. Theoretically, each text file contained a single news article relating to the event. In reality, this was rarely true. As a result, we had to perform considerable preprocessing of our corpora to ensure useful outcomes. The incremental steps to produce our final summary took the form of 9 units to be completed over the course of the semester, with each building on the work of the previous unit. Owing to our lack of domain knowledge at the beginning of the semester (with either fires or natural language processing), we were provided considerable guidance to produce naive, albeit useful, initial solutions. In the first few units, we summarized our collections with brute force approaches: choosing the most frequent words as descriptors, manually generating words to describe the collection, selecting descriptive lemmas, and more. Most of these approaches are characterized by arbitrarily selecting descriptors based on frequency alone, with little consideration for the underlying linguistic significance. From this, we transitioned to more intelligent approaches, attempting to utilize more fine grained techniques to remove extraneous information. We incorporated part-of-speech (POS) tagging to determine the speech type of a word, which allows us to select the most important nouns, for example. Using POS tagging, as well as an ever expanding stopword list, allowed us to remove much of the uninformative results. To further improve our collection, we needed a way to filter out more than just stopwords. In our case, we had many text files that were unrelated to corpus topics, which could corrupt or skew our results. To accomplish this, we built a document classifier to determine if articles are relevant and mark them appropriately, allowing us to include only the relevant articles in our processing. Despite this, our collection still suffered from considerable noise. In almost all of our units we employed various “big data” techniques and tools, including MapReduce and Mahout. These tools allowed us to process extremely large collections of data in an efficient manner. With these tools we could select the most relevant names, topics, and sentences, providing the framework for a summary of the entire collection. It is these insights that lead us to the final two sections of producing a summarization based on preconstructed templates of our events. Using a mixture of every technique we had learned we constructed paragraphs that summarized both fires we had in our collections. For the final two units of our course, we were tasked with creating a paragraph summary of both the Texas Wildfire and the Brazil Nightclub Fire events. We began with a generic fire event template with a set of attributes that would be filled in with the best results we could extract. We made the decision early on to create separate templates for the more specific fire event types of wildfires and building fires, as there are some details which do not overlap among the two event types. In order to fill in our templates we created a process of extracting, refining and finally filling in our gathered results. In order to extract data from our corpora, we created a regular expression for each attribute type and stored any matches found. Next, using only the top 10 results for each attribute, we filtered results by part of speech, constructed a simple grammar to modify the template according to our selected result, and conjugated any present tense verbs to past tense.