Collection Management Tweets Project Fall 2017

Khaghani, Farnaz; Zeng, Junkai; Bhuiyan, Momen; Tabassum, Anika; Bandyopadhyay, Payel

Collection Management Tweets Project Fall 2017

dc.contributor.author	Khaghani, Farnaz	en
dc.contributor.author	Zeng, Junkai	en
dc.contributor.author	Bhuiyan, Momen	en
dc.contributor.author	Tabassum, Anika	en
dc.contributor.author	Bandyopadhyay, Payel	en
dc.date.accessioned	2018-02-02T16:25:24Z	en
dc.date.available	2018-02-02T16:25:24Z	en
dc.date.issued	2018-01-17	en
dc.description.abstract	The report included in this submission documents the work by the Collection Management Tweets (CMT) team, which is a part of the bigger effort in CS5604 on building a state-of-the-art information retrieval and analysis system for the IDEAL (Integrated Digital Event Archiving and Library) and GETAR (Global Event and Trend Archive Research) projects. The mission of the CMT team had two parts: 1) Cleaning 6.2 million tweets from two 2017 event collections named "Solar Eclipse" and "Las Vegas Shooting", and loading them into HBase, an open source, non-relational, distributed database that runs on the Hadoop distributed file system, in support of further use; and 2) Building and storing a social network for the tweet data using a triple-store. For the first part, our work included: A) Making use of the work done by the previous year's class group, where incremental update was done, to introduce a faster development process of data collection and storing; B) Improving the performance of work done by the group from last year. Previously, the cleaning part, e.g., removing profanity words, plus extracting hashtags and mentions, utilized Python. This becomes very slow when the dataset scales up. We introduced parallelization in our tweet cleaning process with the help of Scala and the Hadoop cluster, and made use of different Natural Language Processing libraries for stop word and profanity removal; C) Along with tweet cleaning we also identified and stored Named-Entity-Recognition (NER) entries and Part-of-speech (POS) tags, with the tweets which was not done by the previous team. The cleaned data in HBase from this task is provided to the Classification team for spam detection and to the Clustering and Topic Analysis team for topic analysis. Collection Management Webpage team uses the extracted URLs from the tweets for further processing. Finally, after the data is indexed by the SOLR team, the Front-End team visualizes the tweets to users, and provides access for searching and browsing. In addition to the aforementioned tasks, our responsibilities also included building a network of tweets. This entailed doing research into the types of database that are appropriate for this graph. For storing the network, we used a triple-store database to record different types of edges and relationships in the graph. We also researched methods ascribing importance to nodes and edges in our social networks once they were constructed, and analyzed our networks using these techniques.	en
dc.description.notes	This submission includes files: CMT Final report.pdf - a PDF version of the final report; CMT final report.zip - a zip archive with contents from Overleaf of the LaTeX files used to build the final report; Collection Management Tweet.pptx - a PowerPoint file version of the final project presentation; Collection Management Tweet.pdf - a PDF version of the final project presentation; cs5604f17_cmt.zip - a zip archive of the code developed for cleaning and formatting the tweet data; Social network.zip - code and data related to constructing the triple-store by analyzing the tweet data	en
dc.description.sponsorship	Collaborative Research: Global Event and Trend Archive Research (GETAR) project, supported by the National Science Foundation under Grant No. IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/81996	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	tweet collections	en
dc.subject	Named Entity Recognition	en
dc.subject	triple-store	en
dc.subject	Hadoop cluster	en
dc.subject	Scala	en
dc.subject	part-of-speech (POS) tagging	en
dc.subject	social network analysis	en
dc.title	Collection Management Tweets Project Fall 2017	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en

Collection Management Tweets Project Fall 2017

Files

Original bundle

License bundle

Collections