Automated Vocabulary Building for Characterizing and Forecasting Elections using Social Media Analytics
Twitter has become a popular data source in the recent decade and garnered a significant amount of attention as a surrogate data source for many important forecasting problems. Strong correlations have been observed between Twitter indicators and real-world trends spanning elections, stock markets, book sales, and flu outbreaks. A key ingredient to all methods that use Twitter for forecasting is to agree on a domain-specific vocabulary to track the pertinent tweets, which is typically provided by subject matter experts (SMEs). The language used in Twitter drastically differs from other forms of online discourse, such as news articles and blogs. It constantly evolves over time as users adopt popular hashtags to express their opinions. Thus, the vocabulary used by forecasting algorithms needs to be dynamic in nature and should capture emerging trends over time. This thesis proposes a novel unsupervised learning algorithm that builds a dynamic vocabulary using Probabilistic Soft Logic (PSL), a framework for probabilistic reasoning over relational domains. Using eight presidential elections from Latin America, we show how our query expansion methodology improves the performance of traditional election forecasting algorithms. Through this approach we demonstrate how we can achieve close to a two-fold increase in the number of tweets retrieved for predictions and a 36.90% reduction in prediction error.