CS 5604 2020: Information Storage and Retrieval TWT - Tweet Collection Management Team

dc.contributor.authorBaadkar, Hiteshen
dc.contributor.authorChimote, Pranaven
dc.contributor.authorHicks, Meganen
dc.contributor.authorJuneja, Ikjoten
dc.contributor.authorKusuma, Manishaen
dc.contributor.authorMehta, Ujjvalen
dc.contributor.authorPatil, Akashen
dc.contributor.authorSharma, Irithen
dc.date.accessioned2020-12-17T15:39:07Zen
dc.date.available2020-12-17T15:39:07Zen
dc.date.issued2020-12-16en
dc.description.abstractThe Tweet Collection Management (TWT) Team aims to ingest 5 billion tweets, clean this data, analyze the metadata present, extract key information, classify tweets into categories, and finally, index these tweets into Elasticsearch to browse and query. The main deliverable of this project is a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects. As a starting point, we focused on two development goals: (1) hashtag-based and (2) username-based search for tweets. For IR1, we completed extraction of two fields within our sample collection: hashtags and username. Sample code for TwiRole, a user-classification program, was investigated for use in our project. We were able to sample from multiple collections of tweets, spanning topics like COVID-19 and hurricanes. Initial work encompassed using a sample collection, provided via Google Drive. An NFS-based persistent storage was later involved to allow access to larger collections. In total, we have developed 9 services to extract key information like username, hashtags, geo-location, and keywords from tweets. We have also developed services to allow for parsing and cleaning of raw API data, and backup of data in an Apache Parquet filestore. All services are Dockerized and added to the GitLab Container Registry. The services are deployed in the CS cloud cluster to integrate services into the full search engine workflow. A service is created to convert WARC files to JSON for reading archive files into the application. Unit testing of services is complete and end-to-end tests have been conducted to improve system robustness and avoid failure during deployment. The TWT team has indexed 3,200 tweets into the Elasticsearch index. Future work could involve parallelization of the extraction of metadata, an alternative feature-flag approach, advanced geo-location inference, and adoption of the DMI-TCAT format. Key deliverables include a data body that allows for search, sort, filter, and visualization of raw tweet collections and metadata analysis; a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects; and a user guide to assist those using the system.en
dc.description.notesTWT_CS5604_F2020_Report_Overleaf.zip: the zipped Overleaf file of the TWT team's report. TWT_CS5604_F2020_Code.zip: the zipped folders containing the TWT team's services. TWT_CS5604_F2020_Presentation.pptx: the Powerpoint presentation of the TWT team. TWT_CS5604_F2020_Presentation.pdf: the PDF of the presentation of the TWT team. TWT_CS5604_F2020_Report.pdf: the PDF of the TWT team's report.en
dc.description.sponsorshipNSF CMMI-1638207en
dc.identifier.urihttp://hdl.handle.net/10919/101520en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.subjecttweeten
dc.subjectsearchen
dc.subjectqueryen
dc.subjectTwiRoleen
dc.subjectWARCen
dc.subjectclusteren
dc.subjectParqueten
dc.subjectbig dataen
dc.subjectNFSen
dc.subjectdeploymenten
dc.subjectautomateden
dc.subjectCI/CDen
dc.subjectcontainersen
dc.subjectDockeren
dc.subjectindexingen
dc.subjectElasticsearchen
dc.subjectmetadataen
dc.subjectmethodologyen
dc.subjectworkflowsen
dc.subjectservicesen
dc.subjectinformation storageen
dc.subjectinformation retrievalen
dc.subjectTwitteren
dc.subjectCS5604en
dc.subjectDigital Library Research Laboratoryen
dc.subjectpythonen
dc.titleCS 5604 2020: Information Storage and Retrieval TWT - Tweet Collection Management Teamen
dc.typePresentationen
dc.typeReporten
dc.typeOtheren

Files

Original bundle
Now showing 1 - 5 of 5
Name:
TWT_CS5604_F2020_Code.zip
Size:
30.62 MB
Format:
Loading...
Thumbnail Image
Name:
TWT_CS5604_F2020_Presentation.pdf
Size:
4.14 MB
Format:
Adobe Portable Document Format
Name:
TWT_CS5604_F2020_Presentation.pptx
Size:
1.89 MB
Format:
Microsoft Powerpoint XML
Name:
TWT_CS5604_F2020_Report_Overleaf.zip
Size:
2.68 MB
Format:
Loading...
Thumbnail Image
Name:
TWT_CS5604_F2020_Report.pdf
Size:
2.07 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: