CS5604: Information and Storage Retrieval ​Fall 2016 - CMT (Collection Management Tweets)

TR Number
Date
2016-12-08
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Tech
Abstract

As the Collection Management Tweets team in the Fall 2016 CS5604 class, we were responsible for processing >1.2 billion tweets, including data transfer, noise reduction, tweet augmentation, and storage via several technologies. Our work was the first step in a pipeline that included many teams and ultimately culminated in a comprehensive information retrieval system. We were also responsible for building a social network (or set of networks) for those tweets, along with their tweeters. In this report, we detail our experience with this project. Additionally, we propose solutions for transferring incremental database updates from MySQL to HDFS and subsequently to HBase, derive a graph structure and relationships from entities identified in tweet collections, and offer a query-independent method for estimating the importance of those entities. We achieve these goals through the use of several open-source software packages, and present open, scalable solutions addressing the objectives we were given.

Description
Keywords
Information, Storage, Retrieval, ETL, HBase, HDFS, Pig, social network, tweet ranking, tweet, Twitter, task-independent recommendation, csv2avro, pt-archiver, MySQL, database
Citation