Expressive Forms of Topic Modeling to Support Digital Humanities

Files
TR Number
Date
2014-10-15
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Tech
Abstract

Unstructured textual data is rapidly growing and practitioners from diverse disciplines are expe- riencing a need to structure this massive amount of data. Topic modeling is one of the most used techniques for analyzing and understanding the latent structure of large text collections. Probabilistic graphical models are the main building block behind topic modeling and they are used to express assumptions about the latent structure of complex data. This dissertation address four problems related to drawing structure from high dimensional data and improving the text mining process.

Studying the ebb and flow of ideas during critical events, e.g. an epidemic, is very important to understanding the reporting or coverage around the event or the impact of the event on the society. This can be accomplished by capturing the dynamic evolution of topics underlying a text corpora. We propose an approach to this problem by identifying segment boundaries that detect significant shifts of topic coverage. In order to identify segment boundaries, we embed a temporal segmentation algorithm around a topic modeling algorithm to capture such significant shifts of coverage. A key advantage of our approach is that it integrates with existing topic modeling algorithms in a transparent manner; thus, more sophisticated algorithms can be readily plugged in as research in topic modeling evolves. We apply this algorithm to studying data from the iNeighbors system, and apply our algorithm to six neighborhoods (three economically advantaged and three economically disadvantaged) to evaluate differences in conversations for statistical significance. Our findings suggest that social technologies may afford opportunities for democratic engagement in contexts that are otherwise less likely to support opportunities for deliberation and participatory democracy. We also examine the progression in coverage of historical newspapers about the 1918 influenza epidemic by applying our algorithm on the Washington Times archives. The algorithm is successful in identifying important qualitative features of news coverage of the pandemic.

Visually convincing results of data mining algorithms and models is crucial to analyzing and driving conclusions from the algorithms. We develop ThemeDelta, a visual analytics system for extracting and visualizing temporal trends, clustering, and reorganization in time-indexed textual datasets. ThemeDelta is supported by a dynamic temporal segmentation algorithm that integrates with topic modeling algorithms to identify change points where significant shifts in topics occur. This algorithm detects not only the clustering and associations of keywords in a time period, but also their convergence into topics (groups of keywords) that may later diverge into new groups. The visual representation of ThemeDelta uses sinuous, variable-width lines to show this evolution on a timeline, utilizing color for categories, and line width for keyword strength. We demonstrate how interaction with ThemeDelta helps capture the rise and fall of topics by analyzing archives of historical newspapers, of U.S. presidential campaign speeches, and of social messages collected through iNeighbors. ThemeDelta is evaluated using a qualitative expert user study involving three researchers from rhetoric and history using the historical newspapers corpus.

Time and location are key parameters in any event; neglecting them while discovering topics from a collection of documents results in missing valuable information. We propose a dynamic spatial topic model (DSTM), a true spatio-temporal model that enables disaggregating a corpus's coverage into location-based reporting, and understanding how such coverage varies over time. DSTM naturally generalizes traditional spatial and temporal topic models so that many existing formalisms can be viewed as special cases of DSTM. We demonstrate a successful application of DSTM to multiple newspapers from the Chronicling America repository. We demonstrate how our approach helps uncover key differences in the coverage of the flu as it spread through the nation, and provide possible explanations for such differences.

Major events that can change the flow of people's lives are important to predict, especially when we have powerful models and sufficient data available at our fingertips. The problem of embedding the DSTM in a predictive setting is the last part of this dissertation. To predict events and their locations across time, we present a predictive dynamic spatial topic model that can predict future topics and their locations from unseen documents. We showed the applicability of our proposed approach by applying it on streaming tweets from Latin America. The prediction approach was successful in identify major events and their locations.

Description
Keywords
Topic Modeling, LDA, Segmentation
Citation