CS 5604 2020: Information Storage and Retrieval TWT - Tweet Collection Management Team

Simple item page

dc.contributor.author	Baadkar, Hitesh	en
dc.contributor.author	Chimote, Pranav	en
dc.contributor.author	Hicks, Megan	en
dc.contributor.author	Juneja, Ikjot	en
dc.contributor.author	Kusuma, Manisha	en
dc.contributor.author	Mehta, Ujjval	en
dc.contributor.author	Patil, Akash	en
dc.contributor.author	Sharma, Irith	en
dc.date.accessioned	2020-12-17T15:39:07Z	en
dc.date.available	2020-12-17T15:39:07Z	en
dc.date.issued	2020-12-16	en
dc.description.abstract	The Tweet Collection Management (TWT) Team aims to ingest 5 billion tweets, clean this data, analyze the metadata present, extract key information, classify tweets into categories, and finally, index these tweets into Elasticsearch to browse and query. The main deliverable of this project is a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects. As a starting point, we focused on two development goals: (1) hashtag-based and (2) username-based search for tweets. For IR1, we completed extraction of two fields within our sample collection: hashtags and username. Sample code for TwiRole, a user-classification program, was investigated for use in our project. We were able to sample from multiple collections of tweets, spanning topics like COVID-19 and hurricanes. Initial work encompassed using a sample collection, provided via Google Drive. An NFS-based persistent storage was later involved to allow access to larger collections. In total, we have developed 9 services to extract key information like username, hashtags, geo-location, and keywords from tweets. We have also developed services to allow for parsing and cleaning of raw API data, and backup of data in an Apache Parquet filestore. All services are Dockerized and added to the GitLab Container Registry. The services are deployed in the CS cloud cluster to integrate services into the full search engine workflow. A service is created to convert WARC files to JSON for reading archive files into the application. Unit testing of services is complete and end-to-end tests have been conducted to improve system robustness and avoid failure during deployment. The TWT team has indexed 3,200 tweets into the Elasticsearch index. Future work could involve parallelization of the extraction of metadata, an alternative feature-flag approach, advanced geo-location inference, and adoption of the DMI-TCAT format. Key deliverables include a data body that allows for search, sort, filter, and visualization of raw tweet collections and metadata analysis; a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects; and a user guide to assist those using the system.	en
dc.description.notes	TWT_CS5604_F2020_Report_Overleaf.zip: the zipped Overleaf file of the TWT team's report. TWT_CS5604_F2020_Code.zip: the zipped folders containing the TWT team's services. TWT_CS5604_F2020_Presentation.pptx: the Powerpoint presentation of the TWT team. TWT_CS5604_F2020_Presentation.pdf: the PDF of the presentation of the TWT team. TWT_CS5604_F2020_Report.pdf: the PDF of the TWT team's report.	en
dc.description.sponsorship	NSF CMMI-1638207	en
dc.identifier.uri	http://hdl.handle.net/10919/101520	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.subject	tweet	en
dc.subject	search	en
dc.subject	query	en
dc.subject	TwiRole	en
dc.subject	WARC	en
dc.subject	cluster	en
dc.subject	Parquet	en
dc.subject	big data	en
dc.subject	NFS	en
dc.subject	deployment	en
dc.subject	automated	en
dc.subject	CI/CD	en
dc.subject	containers	en
dc.subject	Docker	en
dc.subject	indexing	en
dc.subject	Elasticsearch	en
dc.subject	metadata	en
dc.subject	methodology	en
dc.subject	workflows	en
dc.subject	services	en
dc.subject	information storage	en
dc.subject	information retrieval	en
dc.subject	Twitter	en
dc.subject	CS5604	en
dc.subject	Digital Library Research Laboratory	en
dc.subject	python	en
dc.title	CS 5604 2020: Information Storage and Retrieval TWT - Tweet Collection Management Team	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Other	en