VTechWorks staff will be away for the winter holidays starting Tuesday, December 24, 2024, through Wednesday, January 1, 2025, and will not be replying to requests during this time. Thank you for your patience, and happy holidays!
 

Collection Management Tweets Project Fall 2017

dc.contributor.authorKhaghani, Farnazen
dc.contributor.authorZeng, Junkaien
dc.contributor.authorBhuiyan, Momenen
dc.contributor.authorTabassum, Anikaen
dc.contributor.authorBandyopadhyay, Payelen
dc.date.accessioned2018-02-02T16:25:24Zen
dc.date.available2018-02-02T16:25:24Zen
dc.date.issued2018-01-17en
dc.description.abstractThe report included in this submission documents the work by the Collection Management Tweets (CMT) team, which is a part of the bigger effort in CS5604 on building a state-of-the-art information retrieval and analysis system for the IDEAL (Integrated Digital Event Archiving and Library) and GETAR (Global Event and Trend Archive Research) projects. The mission of the CMT team had two parts: 1) Cleaning 6.2 million tweets from two 2017 event collections named "Solar Eclipse" and "Las Vegas Shooting", and loading them into HBase, an open source, non-relational, distributed database that runs on the Hadoop distributed file system, in support of further use; and 2) Building and storing a social network for the tweet data using a triple-store. For the first part, our work included: A) Making use of the work done by the previous year's class group, where incremental update was done, to introduce a faster development process of data collection and storing; B) Improving the performance of work done by the group from last year. Previously, the cleaning part, e.g., removing profanity words, plus extracting hashtags and mentions, utilized Python. This becomes very slow when the dataset scales up. We introduced parallelization in our tweet cleaning process with the help of Scala and the Hadoop cluster, and made use of different Natural Language Processing libraries for stop word and profanity removal; C) Along with tweet cleaning we also identified and stored Named-Entity-Recognition (NER) entries and Part-of-speech (POS) tags, with the tweets which was not done by the previous team. The cleaned data in HBase from this task is provided to the Classification team for spam detection and to the Clustering and Topic Analysis team for topic analysis. Collection Management Webpage team uses the extracted URLs from the tweets for further processing. Finally, after the data is indexed by the SOLR team, the Front-End team visualizes the tweets to users, and provides access for searching and browsing. In addition to the aforementioned tasks, our responsibilities also included building a network of tweets. This entailed doing research into the types of database that are appropriate for this graph. For storing the network, we used a triple-store database to record different types of edges and relationships in the graph. We also researched methods ascribing importance to nodes and edges in our social networks once they were constructed, and analyzed our networks using these techniques.en
dc.description.notesThis submission includes files: CMT Final report.pdf - a PDF version of the final report; CMT final report.zip - a zip archive with contents from Overleaf of the LaTeX files used to build the final report; Collection Management Tweet.pptx - a PowerPoint file version of the final project presentation; Collection Management Tweet.pdf - a PDF version of the final project presentation; cs5604f17_cmt.zip - a zip archive of the code developed for cleaning and formatting the tweet data; Social network.zip - code and data related to constructing the triple-store by analyzing the tweet dataen
dc.description.sponsorshipCollaborative Research: Global Event and Trend Archive Research (GETAR) project, supported by the National Science Foundation under Grant No. IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/81996en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjecttweet collectionsen
dc.subjectNamed Entity Recognitionen
dc.subjecttriple-storeen
dc.subjectHadoop clusteren
dc.subjectScalaen
dc.subjectpart-of-speech (POS) taggingen
dc.subjectsocial network analysisen
dc.titleCollection Management Tweets Project Fall 2017en
dc.typeDataseten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 6
Loading...
Thumbnail Image
Name:
CMT Final report.pdf
Size:
2.63 MB
Format:
Adobe Portable Document Format
Name:
cs5604f17_cmt.zip
Size:
1.82 MB
Format:
Name:
Social network.zip
Size:
77.02 MB
Format:
Name:
CMT final report.zip
Size:
3.57 MB
Format:
Loading...
Thumbnail Image
Name:
Collection Management Tweet.pdf
Size:
1.2 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: