CS5604: Information and Storage Retrieval ​Fall 2016 - CMT (Collection Management Tweets)

dc.contributor.authorWagner, Mitchell J.en
dc.contributor.authorAbidi, Faizen
dc.contributor.authorFan, Shuangfeien
dc.date.accessioned2016-12-20T01:35:23Zen
dc.date.available2016-12-20T01:35:23Zen
dc.date.issued2016-12-08en
dc.description.abstractAs the Collection Management Tweets team in the Fall 2016 CS5604 class, we were responsible for processing >1.2 billion tweets, including data transfer, noise reduction, tweet augmentation, and storage via several technologies. Our work was the first step in a pipeline that included many teams and ultimately culminated in a comprehensive information retrieval system. We were also responsible for building a social network (or set of networks) for those tweets, along with their tweeters. In this report, we detail our experience with this project. Additionally, we propose solutions for transferring incremental database updates from MySQL to HDFS and subsequently to HBase, derive a graph structure and relationships from entities identified in tweet collections, and offer a query-independent method for estimating the importance of those entities. We achieve these goals through the use of several open-source software packages, and present open, scalable solutions addressing the objectives we were given.en
dc.description.notesThis submission encompasses our work over the course of the semester, and includes PDF and PowerPoint copies of our final presentation, PDF and LaTeX copies of our final report, and a copy of the code we developed.en
dc.description.sponsorshipNSF: IIS-1619028en
dc.description.sponsorshipNSF: IIS-1319578en
dc.identifier.urihttp://hdl.handle.net/10919/73739en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectInformationen
dc.subjectStorageen
dc.subjectRetrievalen
dc.subjectETLen
dc.subjectHBaseen
dc.subjectHDFSen
dc.subjectPigen
dc.subjectsocial networken
dc.subjecttweet rankingen
dc.subjecttweeten
dc.subjectTwitteren
dc.subjecttask-independent recommendationen
dc.subjectcsv2avroen
dc.subjectpt-archiveren
dc.subjectMySQLen
dc.subjectdatabaseen
dc.titleCS5604: Information and Storage Retrieval ​Fall 2016 - CMT (Collection Management Tweets)en
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
CMT_Code.zip
Size:
12.66 MB
Format:
Loading...
Thumbnail Image
Name:
CMT_Final_Presentation.pdf
Size:
776.53 KB
Format:
Adobe Portable Document Format
Name:
CMT_Final_Presentation.pptx
Size:
1.78 MB
Format:
Microsoft Powerpoint XML
Name:
CMT_Final_Report_LaTeX.zip
Size:
11.56 MB
Format:
Loading...
Thumbnail Image
Name:
CMT_Final_Report.pdf
Size:
9.84 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: