Browsing by Author "Bock, Matthew"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- Classification Project in CS5604, Spring 2016Bock, Matthew; Cantrell, Michael; Shahin, Hossameldin L. (2016-05-04)In the grand scheme of a large Information Retrieval project, the work of our team was that of performing text classification on both tweet collections and their associated webpages. In order to accomplish this task, we sought to complete three primary goals. We began by performing research to determine the best way to extract information that can be used to represent a given document. Following that, we worked to determine the best method to select features and then construct feature vectors. Our final goal was to use the information gathered previously to build an effective way to classify each document in the tweet and webpage collections. These classifiers were built with consideration of the ontology developed for the IDEAL project. To truly show the effectiveness of our work at accomplishing our intended goals, we also provide an evaluation of our methodologies. The team assigned to perform this classification work last year researched various methods and tools that could be useful in accomplishing the goals we have set forth. Last year’s team developed a system that was able to accomplish similar goals to those we have set forth with a promising degree of success. Our goal for this year was to improve upon their successes using new technologies such as Apache Spark. Spark has provided us with the tools needed to build a well optimized system capable of working with the provided small collections of tweets and webpages in a fast and efficient manner. Spark is also very scalable, and based on our results with the small collections we have confidence in the performance of our system on larger collections. Also included in this submission is our final presentation of the project as presented to the CS5604 class, professor, and GRAs. The presentation provides a high level overview of the project requirements and our approach to them, as well as details about our implementation and evaluation. The submission also includes our source code, so that future classes can expand on the work we have done this semester.
- A Framework for Hadoop Based Digital Libraries of TweetsBock, Matthew (Virginia Tech, 2017-07-17)The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. Researchers across varying disciplines have an interest in leveraging DLRL's collections of tweets for their own analyses. However, due to the steep learning curve involved with the required tools (Spark, Scala, HBase, etc.), simply converting the Twitter data into a workable format can be a cumbersome task in itself. This prompted the effort to build a framework that will help in developing code to analyze the Twitter data, run on arbitrary tweet collections, and enable developers to leverage projects designed with this general use in mind. The intent of this thesis work is to create an extensible framework of tools and data structures to represent Twitter data at a higher level and eliminate the need to work with raw text, so as to make the development of new analytics tools faster, easier, and more efficient. To represent this data, several data structures were designed to operate on top of the Hadoop and Spark libraries of tools. The first set of data structures is an abstract representation of a tweet at a basic level, as well as several concrete implementations which represent varying levels of detail to correspond with common sources of tweet data. The second major data structure is a collection structure designed to represent collections of tweet data structures and provide ways to filter, clean, and process the collections. All of these data structures went through an iterative design process based on the needs of the developers. The effectiveness of this effort was demonstrated in four distinct case studies. In the first case study, the framework was used to build a new tool that selects Twitter data from DLRL's archive of tweets, cleans those tweets, and performs sentiment analysis within the topics of a collection's topic model. The second case study applies the provided tools for the purpose of sociolinguistic studies. The third case study explores large datasets to accumulate all possible analyses on the datasets. The fourth case study builds metadata by expanding the shortened URLs contained in the tweets and storing them as metadata about the collections. The framework proved to be useful and cut development time for all four of the case studies.
- Sentiment and Topic AnalysisBartolome, Abigail; Bock, Matthew; Vinayagam, Radha Krishnan; Krishnamurthy, Rahul (Virginia Tech, 2017-05-03)The IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects have collected over 1.5 billion tweets, and webpages from social media and the World Wide Web and indexed them to be easily retrieved and analyzed. This gives researchers an extensive library of documents that reflect the interests and sentiments of the public in reaction to an event. By applying topic analysis to collections of tweets, researchers can learn the topics of most interest or concern to the general public. Adding a layer of sentiment analysis to those topics will illustrate how the public felt in relation to the topics that were found. The Sentiment and Topic Analysis team has designed a system that joins topic analysis and sentiment analysis for researchers who are interested in learning more about public reaction to global events. The tool runs topic analysis on a collection of tweets, and the user can select a topic of interest and assess the sentiments with regard to that topic (i.e., positive vs. negative). This submission covers the background, requirements, design and implementation of our contributions to this project. Furthermore, we include data, scripts, source code, a user manual, and a developer manual to assist in any future work.