Twitter Collections

Abstract

TwitterCollections is a continuation of work from a previous semester team called Library6Btweets. The prior team, which worked during Fall 2021, was composed of Yash Bhargava, Daniel Burdisso, Pranav Dhakal, Anna Herms, and Kenneth Powell. The current team that took this over, and worked on this during Spring 2022, is composed of Matt Gonley, Ryan Nicholas, Nicole Fitz, Griffin Knock, and Derek Bruce. Billions of tweets have been collected by the Digital Library Research Laboratory (DLRL). The tweets were collected in three formats: DMI-TCAT, YTK, and SFM. The tweets collected should be converted into a standard data format to allow for ease of access and data research. The goal is to convert the collected tweets into a unified JSON format. A secondary goal is to create a machine learning model to categorize uncategorized tweets. The standardized format is in two styles: an individual level, and a collection level. Conversion varies for these levels, requiring, respectively, conversion of each tweet and its attributes to a JSON object, and conversion of a whole collection of tweets to a separate JSON object. Our work involved familiarizing ourselves with the previous semester’s work and its schema. The three formats for the tweets were as follows: Social Feed Manager (SFM), yourTwapperKeeper (YTK), and Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT). The previous team designed this schema with these tweet types in mind as well as the Twitter version 2 schema. The previous team also created a collection level schema that listed all of the tweet IDs in a given collection, to allow for determining which tweets belong in which collection. They designed this in accordance with the events archive website. We were given the previous team's conversion scripts for each of the tweet formats as well. Each format needed a different script, as what attributes and what metadata from the tweets was collected differed. The format they were collected in also differed. DMI had the data split into six tables in SQL for any given topic, YTK had the data in separate tables for a topic, and SFM was in the format of JSON. The original scripts were written in Python. For simplicity, we continued using Python as well. Our focus was on optimizing the scripts, as some of them were unusably slow. The scripts also needed to be modified to accommodate scale, where all the data could not be loaded into memory. We were provided six scripts, two for each tweet format: one script for the individual schema and one for the collection level schema. In addition to the optimizations and modifications, a machine learning model was created to accurately classify the events for unlabeled tweet collections. The model can classify the tweets when fed the data from any of the formats. We experimented with a Naive Bayes model and BERT-based Neural Network model, and found the latter superior. The new scripts, optimized versions of prior scripts, best machine learning model, and converted Twitter collection JSON files are our deliverables for this semester. We hope that a standardized set of data can allow for fast and effective research for those who want to incorporate tweets into their study.

Description

Keywords

Twitter, SFM, DMI-TCAT, YTK, JSON, Python

Citation