Show simple item record

dc.contributor.authorChen, Fengen_US
dc.date.accessioned2013-02-19T22:38:39Z
dc.date.available2013-02-19T22:38:39Z
dc.date.issued2013-01-21en_US
dc.identifier.othervt_gsexam:203en_US
dc.identifier.urihttp://hdl.handle.net/10919/19220
dc.description.abstractKnowledge discovery on spatio-temporal datasets has attracted
growing interests. Recent advances on remote sensing technology mean
that massive amounts of spatio-temporal data are being collected,
and its volume keeps increasing at an ever faster pace. It becomes
critical to design efficient algorithms for identifying novel and
meaningful patterns from massive spatio-temporal datasets. Different
from the other data sources, this data exhibits significant
space-time statistical dependence, and the assumption of i.i.d. is
no longer valid. The exact modeling of space-time dependence will
render the exponential growth of model complexity as the data size
increases. This research focuses on the construction of efficient
and effective approaches using approximate inference techniques for
three main mining tasks, including spatial outlier detection, robust
spatio-temporal prediction, and novel applications to real world
problems.

Spatial novelty patterns, or spatial outliers, are those data points
whose characteristics are markedly different from their spatial
neighbors. There are two major branches of spatial outlier detection
methodologies, which can be either global Kriging based or local
Laplacian smoothing based. The former approach requires the exact
modeling of spatial dependence, which is time extensive; and the
latter approach requires the i.i.d. assumption of the smoothed
observations, which is not statistically solid. These two approaches
are constrained to numerical data, but in real world applications we
are often faced with a variety of non-numerical data types, such as
count, binary, nominal, and ordinal. To summarize, the main research
challenges are: 1) how much spatial dependence can be eliminated via
Laplace smoothing; 2) how to effectively and efficiently detect
outliers for large numerical spatial datasets; 3) how to generalize
numerical detection methods and develop a unified outlier detection
framework suitable for large non-numerical datasets; 4) how to
achieve accurate spatial prediction even when the training data has
been contaminated by outliers; 5) how to deal with spatio-temporal
data for the preceding problems.

To address the first and second challenges, we mathematically
validated the effectiveness of Laplacian smoothing on the
elimination of spatial autocorrelations. This work provides
fundamental support for existing Laplacian smoothing based methods.
We also discovered a nontrivial side-effect of Laplacian smoothing,
which ingests additional spatial variations to the data due to
convolution effects. To capture this extra variability, we proposed
a generalized local statistical model, and designed two fast forward
and backward outlier detection methods that achieve a better balance
between computational efficiency and accuracy than most existing
methods, and are well suited to large numerical spatial datasets.

We addressed the third challenge by mapping non-numerical variables
to latent numerical variables via a link function, such as logit
function used in logistic regression, and then utilizing
error-buffer artificial variables, which follow a Student-t
distribution, to capture the large valuations caused by outliers. We
proposed a unified statistical framework, which integrates the
advantages of spatial generalized linear mixed model, robust spatial
linear model, reduced-rank dimension reduction, and Bayesian
hierarchical model. A linear-time approximate inference algorithm
was designed to infer the posterior distribution of the error-buffer
artificial variables conditioned on observations. We demonstrated
that traditional numerical outlier detection methods can be directly
applied to the estimated artificial variables for outliers
detection. To the best of our knowledge, this is the first
linear-time outlier detection algorithm that supports a variety of
spatial attribute types, such as binary, count, ordinal, and
nominal.

To address the fourth and fifth challenges, we proposed a robust
version of the Spatio-Temporal Random Effects (STRE) model, namely
the Robust STRE (R-STRE) model. The regular STRE model is a recently
proposed statistical model for large spatio-temporal data that has a
linear order time complexity, but is not best suited for
non-Gaussian and contaminated datasets. This deficiency can be
systemically addressed by increasing the robustness of the model
using heavy-tailed distributions, such as the Huber, Laplace, or
Student-t distribution to model the measurement error, instead of
the traditional Gaussian. However, the resulting R-STRE model
becomes analytical intractable, and direct application of
approximate inferences techniques still has a cubic order time
complexity. To address the computational challenge, we reformulated
the prediction problem as a maximum a posterior (MAP) problem with a
non-smooth objection function, transformed it to a equivalent
quadratic programming problem, and developed an efficient
interior-point numerical algorithm with a near linear order
complexity. This work presents the first near linear time robust
prediction approach for large spatio-temporal datasets in both
offline and online cases.
en_US
dc.format.mediumETDen_US
dc.publisherVirginia Techen_US
dc.rightsThis Item is protected by copyright and/or related rights. Some uses of this Item may be deemed fair and permitted by law even without permission from the rights holder(s), or the rights holder(s) may have licensed the work for use under certain conditions. For other uses you need to obtain permission from the rights holder(s).en_US
dc.subjectSpatio-Temporal Analysisen_US
dc.subjectOutlier Detectionen_US
dc.subjectRobust Predictionen_US
dc.subjectEnergy Disaggregationen_US
dc.titleEfficient Algorithms for Mining Large Spatio-Temporal Dataen_US
dc.typeDissertationen_US
dc.contributor.departmentComputer Scienceen_US
dc.description.degreePh. D.en_US
thesis.degree.namePh. D.en_US
thesis.degree.leveldoctoralen_US
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen_US
thesis.degree.disciplineComputer Science and Applicationsen_US
dc.contributor.committeechairLu, Chang-Tienen_US
dc.contributor.committeememberRamakrishnan, Narendranen_US
dc.contributor.committeememberWang, Yue J.en_US
dc.contributor.committeememberLou, Wenjingen_US
dc.contributor.committeememberChen, Ing Rayen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record