Tweet Collections

Abstract

Over the past decade, social media use has grown exponentially. More and more people are using social networks to connect and communicate with one another, which has given birth to a new source of data: social media analysis. Since Twitter is one of the largest platforms for text based user input, many tools have been created to analyze data from this social media network.

The TweetCollections project is designed to analyze large amounts of tweet collection metadata, and provide additional information that makes the tweet collections easy to categorize and study. Our clients, Liuqing Li and Ziqian Song, have provided our team with a set of tweet collections and have asked us to assign metadata to them so that future researchers are able to easily find relevant collections. This includes assigning tags and categories, as well as a description with an accompanying source. Formerly, this process had been done by hand. While this improves the accuracy of the data collected, it is too expensive and time consuming to maintain. Our team has been tasked with speeding up the process, using scripts to find information for these fields and fill them out.

The majority of technology used in our approach has been concentrated on Python and its many libraries. Python has made it easy to quickly parse through our tweet collection data by treating the input as an Excel file, as well as pulling other relevant information from third party sources like Wikipedia. The driver will create a new, updated Excel file with the additional data, categories, and tags. Additionally, an ontology will be produced and serve as reference for categorizing topics listed in the fields from the input.

The GETAR team has created over 1400 tweet collections, containing over two billion tweets. To help categorize this data, they also store metadata about these collections in a Comma Separated Value (.csv) file. This project will result in a product that will take in a CSV file of the archive of tweet collections metadata as input, with the required fields (such as “Keyword” and/or “Date”) filled in, and produce a separate Comma Separated Value file as output with missing fields filled in. The overarching problem is that each category term is rather vague, and more data will need to be pulled out of this term. Additionally, an ontology will be produced and serve as reference for categorizing topics listed in the fields from the input. The completed project contains three Python scripts: csv_parser.py, search_wikipedia.py, and GUI.py. Together, these create a program that can take in an input CSV file and integer range for which lines to run, and then return a new CSV file with the additional metadata filled in. Also included with the deliverable is a populated Excel file, with over 150 additional entries of metadata, and an error file containing recommendations for the ontology. These recommendations are generated from any results our driver determines as ‘low relevance’, and returns options with a higher term frequency.

Description
Keywords
Tweet Collections, CSV Parsing, GETAR
Citation