CS5604: Information and Storage Retrieval Fall 2016 - CMT (Collection Management Tweets)
dc.contributor.author | Wagner, Mitchell J. | en |
dc.contributor.author | Abidi, Faiz | en |
dc.contributor.author | Fan, Shuangfei | en |
dc.date.accessioned | 2016-12-20T01:35:23Z | en |
dc.date.available | 2016-12-20T01:35:23Z | en |
dc.date.issued | 2016-12-08 | en |
dc.description.abstract | As the Collection Management Tweets team in the Fall 2016 CS5604 class, we were responsible for processing >1.2 billion tweets, including data transfer, noise reduction, tweet augmentation, and storage via several technologies. Our work was the first step in a pipeline that included many teams and ultimately culminated in a comprehensive information retrieval system. We were also responsible for building a social network (or set of networks) for those tweets, along with their tweeters. In this report, we detail our experience with this project. Additionally, we propose solutions for transferring incremental database updates from MySQL to HDFS and subsequently to HBase, derive a graph structure and relationships from entities identified in tweet collections, and offer a query-independent method for estimating the importance of those entities. We achieve these goals through the use of several open-source software packages, and present open, scalable solutions addressing the objectives we were given. | en |
dc.description.notes | This submission encompasses our work over the course of the semester, and includes PDF and PowerPoint copies of our final presentation, PDF and LaTeX copies of our final report, and a copy of the code we developed. | en |
dc.description.sponsorship | NSF: IIS-1619028 | en |
dc.description.sponsorship | NSF: IIS-1319578 | en |
dc.identifier.uri | http://hdl.handle.net/10919/73739 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons CC0 1.0 Universal Public Domain Dedication | en |
dc.rights.uri | http://creativecommons.org/publicdomain/zero/1.0/ | en |
dc.subject | Information | en |
dc.subject | Storage | en |
dc.subject | Retrieval | en |
dc.subject | ETL | en |
dc.subject | HBase | en |
dc.subject | HDFS | en |
dc.subject | Pig | en |
dc.subject | social network | en |
dc.subject | tweet ranking | en |
dc.subject | tweet | en |
dc.subject | en | |
dc.subject | task-independent recommendation | en |
dc.subject | csv2avro | en |
dc.subject | pt-archiver | en |
dc.subject | MySQL | en |
dc.subject | database | en |
dc.title | CS5604: Information and Storage Retrieval Fall 2016 - CMT (Collection Management Tweets) | en |
dc.type | Presentation | en |
dc.type | Report | en |
dc.type | Software | en |
Files
Original bundle
1 - 5 of 5
Loading...
- Name:
- CMT_Final_Presentation.pdf
- Size:
- 776.53 KB
- Format:
- Adobe Portable Document Format
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: