NRV Tweets and RSS feeds

Abstract

The goal of this project was to associate existing data in the Virtual Town Square database from the New River Valley area with topical metadata. We took a database of approximately 360,000 tweets and 15,000 RSS news stories collected in the last two years and associated each RSS story and tweet with topics. The open-source natural language processing library Mallet was used to perform topical modeling on the data using Latent Dirichlet Allocation, which was then used to create a Solr instance of searchable tweets and news stories. Topical modeling was not done around specific events, instead the entire tweet data (and entire RSS data) was used as the corpus. The tweet data was analyzed separately from the RSS stories, so the generated topics are specific to each dataset. This report details the methodology used in our work in the Methodology section and contains a detailed Developer’s Guide and User’s Guide so that others may continue our work. The client was satisfied with the outcome of this project as, even though tweets have generally been considered too short to be run through a topical modeling process, we generated topics for each tweet that appear to be relevant and accurate.

Description

This collection contains the source code, programs, documentation, and example data used in the project. Please review the "Final Report and Technical Manual" for a comprehensive overview of the project. The open source library Mallet was used and is referenced here: McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu. 2002.

Keywords

nlp, natural language processing, lda, latent dirichlet allocation, mallet, open source, tweets, rss, nrv, new river valley, blacksburg, IDEAL

Citation