Tweet Collections

dc.contributor.authorKolakaleti, Sushenen
dc.contributor.authorD'Alessandro, Kevinen
dc.contributor.authorNarantsatsralt, Enken
dc.contributor.authorMruz, Ilyaen
dc.contributor.authorLam, Chrisen
dc.date.accessioned2024-05-09T17:57:50Zen
dc.date.available2024-05-09T17:57:50Zen
dc.date.issued2024-05-09en
dc.description.abstractFor a series of various Virginia Tech research projects related to Dr. Andrea Kavanaugh, more than six billion tweets between the years 2009-2024 were collected to be used for research purposes. These tweets cover many topics, but primarily focus on trends and important events that occurred during the time period. These tweets were collected in three different formats: Social Feed Manager (SFM), yourTwapperKeeper (YTK), and Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT). The original focus of the project was to convert these tweets into a singular format (JSON) to make tweet access easier and simplify the research process. The team in the Fall of 2021 consisting of Yash Bhargava, Daniel Burdisso, Pranav Dhakal, Anna Herms, and Kenneth Powell were the first to take on this project and managed to finish the process of writing the initial Python scripts used to convert the three tweet formats to JSON. They originally provided six different Python scripts, two for each of the three tweet formats, one for the individual schema and the other for the collection level schema. However, large parts of these Python scripts were highly unoptimized and would take an unreasonably long time to run. Thus, the team in Spring of 2022 consisting of Matt Gonley, Ryan Nicholas, Nicole Fitz, Griffin Knock, and Derek Bruce took on the project and managed to optimize a portion of the original Python scripts in addition to implementing a BERT-based machine learning model used to classify the tweets. They adjusted the scripts to better accommodate scale and were able to begin the tweet conversion process, getting through about 800 million of the roughly 6 billion tweets collected. This project was taken over again in Spring of 2024, and began by writing additional automation scripts to simplify the process and reduce the amount of work that had to be done manually for the SFM conversion process. In addition to writing new scripts, our team updated some of the scripts done by the past team, to better suit our uses. We exported 45 collections from the SFM machine and were able to convert 9,744,468 tweets from SFM. Regarding DMI_TCAT and YTK, the raw SQL files needed to be transferred to a new database in order to convert the remaining tweets. This process was begun for DMI and YTK at the Digital Library Research Laboratory, located in room 2030 at Torgerson Hall, and will be continued into Summer 2024. Regarding the machine learning aspects of the project, we implemented a new hate speech classifier, due to the prevalence of hate speech on the internet. We ran a test with both a GloVe model and a BERT model with a Naive Bayes classifier, before ultimately settling on the GloVe model due to the speed being significantly faster while still providing enough accuracy to be useful.en
dc.description.sponsorshipAndrea Kavanaugh (Associate Director Center for Human/Computer Interaction, kavan@vt.edu), Mohamed Magdy Farag (Research Associate, VTTI-Sustainable Mobility, mmagdy@vt.edu), Satvik Chekuri (Ph.D. student, GTA, satvikchekuri@vt.edu).en
dc.identifier.urihttps://hdl.handle.net/10919/118936en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsCC0 1.0 Universalen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjecttweeten
dc.subjectcollectionen
dc.subjectJSONen
dc.subjectPythonen
dc.subjectMachine Learningen
dc.subjectClassificationen
dc.subjectData Conversionen
dc.subjectData Processingen
dc.titleTweet Collectionsen
dc.typeReporten
dc.typePresentationen
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
TweetCollectionScripts.zip
Size:
78.8 MB
Format:
Description:
Code to convert individual and collection level tweets in the 3 forms of YTK, DMI, and SFM. Each is in its own folder relative to the source of the data. Additionally contains a requirements.txt and the Events Archive spreadsheet.
Loading...
Thumbnail Image
Name:
TweetCollectionPresentation.pdf
Size:
2.37 MB
Format:
Adobe Portable Document Format
Description:
PDF version of the final presentation.
Name:
TweetCollectionPresentation.pptx
Size:
2.59 MB
Format:
Microsoft Powerpoint XML
Description:
PowerPoint version of the final presentation.
Loading...
Thumbnail Image
Name:
TweetCollectionReport.pdf
Size:
1.05 MB
Format:
Adobe Portable Document Format
Description:
PDF version of the final report.
Name:
TweetCollectionReport.docx
Size:
1.95 MB
Format:
Microsoft Word XML
Description:
Word version of the final report.
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: