Efficient Algorithms for Mining Large Spatio-Temporal Data

Chen, Feng

Efficient Algorithms for Mining Large Spatio-Temporal Data

dc.contributor.author	Chen, Feng	en
dc.contributor.committeechair	Lu, Chang-Tien	en
dc.contributor.committeemember	Ramakrishnan, Naren	en
dc.contributor.committeemember	Wang, Yue J.	en
dc.contributor.committeemember	Lou, Wenjing	en
dc.contributor.committeemember	Chen, Ing-Ray	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2013-02-19T22:38:39Z	en
dc.date.available	2013-02-19T22:38:39Z	en
dc.date.issued	2013-01-21	en
dc.description.abstract	Knowledge discovery on spatio-temporal datasets has attracted<br />growing interests. Recent advances on remote sensing technology mean<br />that massive amounts of spatio-temporal data are being collected,<br />and its volume keeps increasing at an ever faster pace. It becomes<br />critical to design efficient algorithms for identifying novel and<br />meaningful patterns from massive spatio-temporal datasets. Different<br />from the other data sources, this data exhibits significant<br />space-time statistical dependence, and the assumption of i.i.d. is<br />no longer valid. The exact modeling of space-time dependence will<br />render the exponential growth of model complexity as the data size<br />increases. This research focuses on the construction of efficient<br />and effective approaches using approximate inference techniques for<br />three main mining tasks, including spatial outlier detection, robust<br />spatio-temporal prediction, and novel applications to real world<br />problems.<br /><br />Spatial novelty patterns, or spatial outliers, are those data points<br />whose characteristics are markedly different from their spatial<br />neighbors. There are two major branches of spatial outlier detection<br />methodologies, which can be either global Kriging based or local<br />Laplacian smoothing based. The former approach requires the exact<br />modeling of spatial dependence, which is time extensive; and the<br />latter approach requires the i.i.d. assumption of the smoothed<br />observations, which is not statistically solid. These two approaches<br />are constrained to numerical data, but in real world applications we<br />are often faced with a variety of non-numerical data types, such as<br />count, binary, nominal, and ordinal. To summarize, the main research<br />challenges are: 1) how much spatial dependence can be eliminated via<br />Laplace smoothing; 2) how to effectively and efficiently detect<br />outliers for large numerical spatial datasets; 3) how to generalize<br />numerical detection methods and develop a unified outlier detection<br />framework suitable for large non-numerical datasets; 4) how to<br />achieve accurate spatial prediction even when the training data has<br />been contaminated by outliers; 5) how to deal with spatio-temporal<br />data for the preceding problems.<br /><br />To address the first and second challenges, we mathematically<br />validated the effectiveness of Laplacian smoothing on the<br />elimination of spatial autocorrelations. This work provides<br />fundamental support for existing Laplacian smoothing based methods.<br />We also discovered a nontrivial side-effect of Laplacian smoothing,<br />which ingests additional spatial variations to the data due to<br />convolution effects. To capture this extra variability, we proposed<br />a generalized local statistical model, and designed two fast forward<br />and backward outlier detection methods that achieve a better balance<br />between computational efficiency and accuracy than most existing<br />methods, and are well suited to large numerical spatial datasets.<br /><br />We addressed the third challenge by mapping non-numerical variables<br />to latent numerical variables via a link function, such as logit<br />function used in logistic regression, and then utilizing<br />error-buffer artificial variables, which follow a Student-t<br />distribution, to capture the large valuations caused by outliers. We<br />proposed a unified statistical framework, which integrates the<br />advantages of spatial generalized linear mixed model, robust spatial<br />linear model, reduced-rank dimension reduction, and Bayesian<br />hierarchical model. A linear-time approximate inference algorithm<br />was designed to infer the posterior distribution of the error-buffer<br />artificial variables conditioned on observations. We demonstrated<br />that traditional numerical outlier detection methods can be directly<br />applied to the estimated artificial variables for outliers<br />detection. To the best of our knowledge, this is the first<br />linear-time outlier detection algorithm that supports a variety of<br />spatial attribute types, such as binary, count, ordinal, and<br />nominal.<br /><br />To address the fourth and fifth challenges, we proposed a robust<br />version of the Spatio-Temporal Random Effects (STRE) model, namely<br />the Robust STRE (R-STRE) model. The regular STRE model is a recently<br />proposed statistical model for large spatio-temporal data that has a<br />linear order time complexity, but is not best suited for<br />non-Gaussian and contaminated datasets. This deficiency can be<br />systemically addressed by increasing the robustness of the model<br />using heavy-tailed distributions, such as the Huber, Laplace, or<br />Student-t distribution to model the measurement error, instead of<br />the traditional Gaussian. However, the resulting R-STRE model<br />becomes analytical intractable, and direct application of<br />approximate inferences techniques still has a cubic order time<br />complexity. To address the computational challenge, we reformulated<br />the prediction problem as a maximum a posterior (MAP) problem with a<br />non-smooth objection function, transformed it to a equivalent<br />quadratic programming problem, and developed an efficient<br />interior-point numerical algorithm with a near linear order<br />complexity. This work presents the first near linear time robust<br />prediction approach for large spatio-temporal datasets in both<br />offline and online cases.	en
dc.description.degree	Ph. D.	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:203	en
dc.identifier.uri	http://hdl.handle.net/10919/19220	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Spatio-Temporal Analysis	en
dc.subject	Outlier Detection	en
dc.subject	Robust Prediction	en
dc.subject	Energy Disaggregation	en
dc.title	Efficient Algorithms for Mining Large Spatio-Temporal Data	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chen_F_D_2013.pdf
Size:: 2.45 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations