Tweet Collections

Kolakaleti, Sushen; D'Alessandro, Kevin; Narantsatsralt, Enk; Mruz, Ilya; Lam, Chris

Tweet Collections

dc.contributor.author	Kolakaleti, Sushen	en
dc.contributor.author	D'Alessandro, Kevin	en
dc.contributor.author	Narantsatsralt, Enk	en
dc.contributor.author	Mruz, Ilya	en
dc.contributor.author	Lam, Chris	en
dc.date.accessioned	2024-05-09T17:57:50Z	en
dc.date.available	2024-05-09T17:57:50Z	en
dc.date.issued	2024-05-09	en
dc.description.abstract	For a series of various Virginia Tech research projects related to Dr. Andrea Kavanaugh, more than six billion tweets between the years 2009-2024 were collected to be used for research purposes. These tweets cover many topics, but primarily focus on trends and important events that occurred during the time period. These tweets were collected in three different formats: Social Feed Manager (SFM), yourTwapperKeeper (YTK), and Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT). The original focus of the project was to convert these tweets into a singular format (JSON) to make tweet access easier and simplify the research process. The team in the Fall of 2021 consisting of Yash Bhargava, Daniel Burdisso, Pranav Dhakal, Anna Herms, and Kenneth Powell were the first to take on this project and managed to finish the process of writing the initial Python scripts used to convert the three tweet formats to JSON. They originally provided six different Python scripts, two for each of the three tweet formats, one for the individual schema and the other for the collection level schema. However, large parts of these Python scripts were highly unoptimized and would take an unreasonably long time to run. Thus, the team in Spring of 2022 consisting of Matt Gonley, Ryan Nicholas, Nicole Fitz, Griffin Knock, and Derek Bruce took on the project and managed to optimize a portion of the original Python scripts in addition to implementing a BERT-based machine learning model used to classify the tweets. They adjusted the scripts to better accommodate scale and were able to begin the tweet conversion process, getting through about 800 million of the roughly 6 billion tweets collected. This project was taken over again in Spring of 2024, and began by writing additional automation scripts to simplify the process and reduce the amount of work that had to be done manually for the SFM conversion process. In addition to writing new scripts, our team updated some of the scripts done by the past team, to better suit our uses. We exported 45 collections from the SFM machine and were able to convert 9,744,468 tweets from SFM. Regarding DMI_TCAT and YTK, the raw SQL files needed to be transferred to a new database in order to convert the remaining tweets. This process was begun for DMI and YTK at the Digital Library Research Laboratory, located in room 2030 at Torgerson Hall, and will be continued into Summer 2024. Regarding the machine learning aspects of the project, we implemented a new hate speech classifier, due to the prevalence of hate speech on the internet. We ran a test with both a GloVe model and a BERT model with a Naive Bayes classifier, before ultimately settling on the GloVe model due to the speed being significantly faster while still providing enough accuracy to be useful.	en
dc.description.sponsorship	Andrea Kavanaugh (Associate Director Center for Human/Computer Interaction, kavan@vt.edu), Mohamed Magdy Farag (Research Associate, VTTI-Sustainable Mobility, mmagdy@vt.edu), Satvik Chekuri (Ph.D. student, GTA, satvikchekuri@vt.edu).	en
dc.identifier.uri	https://hdl.handle.net/10919/118936	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	CC0 1.0 Universal	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	tweet	en
dc.subject	collection	en
dc.subject	JSON	en
dc.subject	Python	en
dc.subject	Machine Learning	en
dc.subject	Classification	en
dc.subject	Data Conversion	en
dc.subject	Data Processing	en
dc.title	Tweet Collections	en
dc.type	Report	en
dc.type	Presentation	en
dc.type	Software	en

Files

Original bundle

Now showing 1 - 5 of 5

Name:: TweetCollectionScripts.zip
Size:: 78.8 MB
Format:
Description:: Code to convert individual and collection level tweets in the 3 forms of YTK, DMI, and SFM. Each is in its own folder relative to the source of the data. Additionally contains a requirements.txt and the Events Archive spreadsheet.

Download