Browsing by Author "Yang, Limin"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Big Data Text Summarization - Attack WestminsterGallagher, Colm; Dyer, Jamie; Liebold, Jeanine; Becker, Aaron; Yang, Limin (Virginia Tech, 2018-12-14)Automatic text summarization, a process of distilling the most important information from a text document, is to create an abridged summary with software. Basically, in this task, we can regard the "summarization" as a function which takes a single document or multiple documents as an input and has the summary as an output. There are two ways that we can manage to create a summary: extractive and abstractive. The extractive summarization means that we select the most relevant sentences from the input and concatenate them to form a summary. Graph-based algorithm like TextRank, Feature-based models like TextTeaser, Topic-based models like Latent Semantic Analysis (LSA), and Grammar-based models could be viewed as approaches to extractive summarization. Abstractive summarization aims to create a summary similar to humans. It keeps the original intent, but uses new phrases and words not found in the original text. One of the most commonly used models is the encoder-decoder model, a neural network model that is mainly used in machine translation tasks. Recently, there is another combination approach that combines both extractive and abstractive summarization, like Pointer-Generator Network, and the Extract then Abstract model. In this course, we're given both a small dataset (about 500 documents) and a big dataset (about 11,300 documents) that mainly consist of web archives about a specific event. Our group is focusing on reports about a terrorist event -- Attack Westminster. It occurred outside the Palace of Westminster in London on March 22, 2017. The attacker, 52 year-old Briton Khalid Masood, drove a car into pedestrians on the pavement, injuring more than 50 people, 5 of them fatally. The attack was treated as "Islamist-related terrorism". We first created a Solr index for both the small dataset and the big dataset, which helped us to perform various queries to know more about the data. Additionally, the index aided another team to create a gold standard summary of our dataset for us. Then we gradually delved into different concepts and topics about text summarization, as well as natural language processing. Specifically, we managed to utilize the NLTK library and the spaCy package to create a set of most frequent important words, WordNet synsets that cover the word, words constrained by part of speech (POS), and frequent and important named entities. We also applied the LSA model to retrieve the most important topics. By clustering the dataset with k-means clustering, and selecting important sentences from the clusters using an implementation of the TextRank algorithm, we were able to generate a multi-paragraph summary. With the help of named entity recognition and pattern-based matching, we confidently extracted information like the name of the attacker, date, location, nearby landmarks, the number killed, the number injured, and the type of the attack. We then drafted a template of a readable summary to fill in the slots and values. Each of these results individually formed a summary that captures the most important information of the Westminster Attack. The most successful results were obtained using the extractive summarization method (k-means clustering and TextRank), the slot-value method (named entity recognition and pattern-based matching), and the abstractive summarization method (deep learning). We evaluated each of the summaries obtained using a combination of ROUGE metrics as well as named-entity coverage compared to the gold standard summary created by team 3. Overall, the best summary was obtained using the extractive summarization method, with both ROUGE metrics and named-entity coverage outperforming other methods.
- How Similar Are Forest Disturbance Maps Derived from Different Landsat Time Series Algorithms?Cohen, Warren B.; Healey, Sean P.; Yang, Zhiqiang; Stehman, Stephen V.; Brewer, C. Kenneth; Brooks, Evan B.; Gorelick, Noel; Huang, Chengqaun; Hughes, M. Joseph; Kennedy, Robert E.; Loveland, Thomas R.; Moisen, Gretchen G.; Schroeder, Todd A.; Vogelmann, James E.; Woodcock, Curtis E.; Yang, Limin; Zhu, Zhe (MDPI, 2017-03-26)Disturbance is a critical ecological process in forested systems, and disturbance maps are important for understanding forest dynamics. Landsat data are a key remote sensing dataset for monitoring forest disturbance and there recently has been major growth in the development of disturbance mapping algorithms. Many of these algorithms take advantage of the high temporal data volume to mine subtle signals in Landsat time series, but as those signals become subtler, they are more likely to be mixed with noise in Landsat data. This study examines the similarity among seven different algorithms in their ability to map the full range of magnitudes of forest disturbance over six different Landsat scenes distributed across the conterminous US. The maps agreed very well in terms of the amount of undisturbed forest over time; however, for the ~30% of forest mapped as disturbed in a given year by at least one algorithm, there was little agreement about which pixels were affected. Algorithms that targeted higher-magnitude disturbances exhibited higher omission errors but lower commission errors than those targeting a broader range of disturbance magnitudes. These results suggest that a user of any given forest disturbance map should understand the map’s strengths and weaknesses (in terms of omission and commission error rates), with respect to the disturbance targets of interest.