Segmenting, Summarizing and Predicting Data Sequences

Chen, Liangzhe

Segmenting, Summarizing and Predicting Data Sequences

dc.contributor.author	Chen, Liangzhe	en
dc.contributor.committeechair	Prakash, B. Aditya	en
dc.contributor.committeemember	Liu, Yan	en
dc.contributor.committeemember	Ramakrishnan, Naren	en
dc.contributor.committeemember	Lu, Chang-Tien	en
dc.contributor.committeemember	Fox, Edward A.	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2018-06-20T08:02:14Z	en
dc.date.available	2018-06-20T08:02:14Z	en
dc.date.issued	2018-06-19	en
dc.description.abstract	Temporal data is ubiquitous nowadays and can be easily found in many applications. Consider the extensively studied social media website Twitter. All the information can be associated with time stamps, and thus form different types of data sequences: a sequence of feature values of users who retweet a message, a sequence of tweets from a certain user, or a sequence of the evolving friendship networks. Mining these data sequences is an important task, which reveals patterns in the sequences, and it is a very challenging task as it usually requires different techniques for different sequences. The problem becomes even more complicated when the sequences are correlated. In this dissertation, we study the following two types of data sequences, and we show how to carefully exploit within-sequence and across-sequence correlations to develop more effective and scalable algorithms. 1. Multi-dimensional value sequences: We study sequences of multi-dimensional values, where each value is associated with a time stamp. Such value sequences arise in many domains such as epidemiology (medical records), social media (keyword trends), etc. Our goals are: for individual sequences, to find a segmentation of the sequence to capture where the pattern changes; for multiple correlated sequences, to use the correlations between sequences to further improve our segmentation; and to automatically find explanations of the segmentation results. 2. Social media post sequences: Driven by applications from popular social media websites such as Twitter and Weibo, we study the modeling of social media post sequences. Our goal is to understand how the posts (like tweets) are generated and how we can gain understanding of the users behind these posts. For individual social media post sequences, we study a prediction problem to find the users' latent state changes over the sequence. For dependent post sequences, we analyze the social influence among users, and how it affects users in generating posts and links. Our models and algorithms lead to useful discoveries, and they solve real problems in Epidemiology, Social Media and Critical Infrastructure Systems. Further, most of the algorithms and frameworks we propose can be extended to solve sequence mining problems in other domains as well.	en
dc.description.abstractgeneral	Temporal data is ubiquitous nowadays and can be easily found in many applications. Consider the extensively studied social media website Twitter. All the information can be associated with time stamps, and thus form different types of data sequences: a sequence of feature values of users who retweet a message, a sequence of tweets from a certain user, or a sequence of the evolving friendship networks. Mining these data sequences is an important task, which reveals patterns in the sequences, and helps downstream tasks like data compression and visualization. At the same time, it is a very challenging task as it usually requires different techniques for different sequences. The problem becomes even more complicated when the sequences are correlated. In this dissertation, we first study value sequences, where objects in the sequence are multidimensional data values, and move to text sequences, where each object in the sequence is a text document (like a tweet). For each of these data sequences, we study them either as independent individual sequences, or as a group of dependent sequences. We then show how to carefully exploit different types of correlations behind the sequences to develop more effective and scalable algorithms. Our models and algorithms lead to useful discoveries, and they solve real problems in Epidemiology, Social Media and Critical Infrastructure Systems. Further, most of the algorithms and frameworks we propose can be extended to solve sequence mining problems in other domains as well.	en
dc.description.degree	Ph. D.	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:14751	en
dc.identifier.uri	http://hdl.handle.net/10919/83573	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Sequence Mining	en
dc.subject	Segmentation	en
dc.subject	Topic Modeling	en
dc.title	Segmenting, Summarizing and Predicting Data Sequences	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chen_L_D_2018.pdf
Size:: 7.54 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations