A Framework for Hadoop Based Digital Libraries of Tweets

Bock, Matthew

A Framework for Hadoop Based Digital Libraries of Tweets

dc.contributor.author	Bock, Matthew	en
dc.contributor.committeechair	Fox, Edward A.	en
dc.contributor.committeemember	Reddy, Chandan K.	en
dc.contributor.committeemember	Kavanaugh, Andrea L.	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2017-07-18T08:00:20Z	en
dc.date.available	2017-07-18T08:00:20Z	en
dc.date.issued	2017-07-17	en
dc.description.abstract	The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. Researchers across varying disciplines have an interest in leveraging DLRL's collections of tweets for their own analyses. However, due to the steep learning curve involved with the required tools (Spark, Scala, HBase, etc.), simply converting the Twitter data into a workable format can be a cumbersome task in itself. This prompted the effort to build a framework that will help in developing code to analyze the Twitter data, run on arbitrary tweet collections, and enable developers to leverage projects designed with this general use in mind. The intent of this thesis work is to create an extensible framework of tools and data structures to represent Twitter data at a higher level and eliminate the need to work with raw text, so as to make the development of new analytics tools faster, easier, and more efficient. To represent this data, several data structures were designed to operate on top of the Hadoop and Spark libraries of tools. The first set of data structures is an abstract representation of a tweet at a basic level, as well as several concrete implementations which represent varying levels of detail to correspond with common sources of tweet data. The second major data structure is a collection structure designed to represent collections of tweet data structures and provide ways to filter, clean, and process the collections. All of these data structures went through an iterative design process based on the needs of the developers. The effectiveness of this effort was demonstrated in four distinct case studies. In the first case study, the framework was used to build a new tool that selects Twitter data from DLRL's archive of tweets, cleans those tweets, and performs sentiment analysis within the topics of a collection's topic model. The second case study applies the provided tools for the purpose of sociolinguistic studies. The third case study explores large datasets to accumulate all possible analyses on the datasets. The fourth case study builds metadata by expanding the shortened URLs contained in the tweets and storing them as metadata about the collections. The framework proved to be useful and cut development time for all four of the case studies.	en
dc.description.abstractgeneral	The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. Researchers across varying disciplines have an interest in leveraging DLRL’s collections of tweets for their own analyses. However, due to the steep learning curve involved with the required tools, simply converting the Twitter data into a workable format can be a cumbersome task in itself. This prompted the effort to build a programming framework that will help in developing code to analyze the Twitter data, run on arbitrary tweet collections, and enable developers to leverage projects designed with this general use in mind. The intent of this thesis work is to create an extensible framework of tools and data structures to represent Twitter data at a higher level and eliminate the need to work with raw text, so as to make the development of new analytics tools faster, easier, and more efficient. The effectiveness of this effort was demonstrated in four distinct case studies. In the first case study, the framework was used to build a new tool that selects Twitter data from DLRL’s archive of tweets, cleans those tweets, and performs sentiment analysis within the topics of a collection’s topic model. The second case study applies the provided tools for the purpose of sociolinguistic studies. The third case study explores large datasets to accumulate all possible analyses on the datasets. The fourth case study builds metadata by expanding the shortened URLs contained in the tweets and storing them as metadata about the collections. The framework proved to be useful and cut development time for all four of the case studies.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:11657	en
dc.identifier.uri	http://hdl.handle.net/10919/78351	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	big data	en
dc.subject	digital libraries	en
dc.subject	data structures	en
dc.title	A Framework for Hadoop Based Digital Libraries of Tweets	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Bock_M_T_2017.pdf
Size:: 1.1 MB
Format:: Adobe Portable Document Format

Download

Name:: Bock_M_T_2017_support_2.zip
Size:: 39.17 MB
Format:
Description:: Supporting documents

Download

Collections

Masters Theses