Probabilistic Modeling of Multi-relational and Multivariate Discrete Data
Abstract
Modeling and discovering knowledge from multi-relational and multivariate
discrete data is a crucial task that arises in many research and application
domains, e.g. text mining, intelligence analysis, epidemiology, social science,
etc. In this dissertation, we study and address three problems involving the
modeling of multi-relational discrete data and multivariate multi-response count
data, viz. (1) discovering surprising patterns from multi-relational data, (2)
constructing a generative model for multivariate categorical data, and (3)
simultaneously modeling multivariate multi-response count data and estimating
covariance structures between multiple responses.
To discover surprising multi-relational patterns, we first study the ``where
do I start?'' problem originating from intelligence analysis. By studying nine
methods with origins in association analysis, graph metrics, and probabilistic
modeling, we identify several classes of algorithmic strategies that can supply
starting points to analysts, and thus help to discover interesting
multi-relational patterns from datasets. To actually mine for interesting
multi-relational patterns, we represent the multi-relational patterns as dense
and well-connected chains of biclusters over multiple relations, and model the
discrete data by the maximum entropy principle, such that in a statistically
well-founded way we can gauge the surprisingness of a discovered bicluster chain
with respect to what we already know. We design an algorithm for approximating
the most informative multi-relational patterns, and provide strategies to
incrementally organize discovered patterns into the background model. We
illustrate how our method is adept at discovering the hidden plot in multiple
synthetic and real-world intelligence analysis datasets. Our approach naturally
generalizes traditional attribute-based maximum entropy models for single
relations, and further supports iterative, human-in-the-loop, knowledge
discovery.
To build a generative model for multivariate categorical data, we apply the
maximum entropy principle to propose a categorical maximum entropy model such
that in a statistically well-founded way we can optimally use given prior
information about the data, and are unbiased otherwise. Generally, inferring the
maximum entropy model could be infeasible in practice. Here, we leverage the
structure of the categorical data space to design an efficient model inference
algorithm to estimate the categorical maximum entropy model, and we demonstrate
how the proposed model is adept at estimating underlying data distributions. We
evaluate this approach against both simulated data and US census datasets, and
demonstrate its feasibility using an epidemic simulation application.
Modeling data with multivariate count responses is a challenging problem due to
the discrete nature of the responses. Existing methods for univariate count
responses cannot be easily extended to the multivariate case since the
dependency among multiple responses needs to be properly accounted for. To model
multivariate data with multiple count responses, we propose a novel multivariate
Poisson log-normal model (MVPLN). By simultaneously estimating the regression
coefficients and inverse covariance matrix over the latent variables with an
efficient Monte Carlo EM algorithm, the proposed model takes advantages of
association among multiple count responses to improve the model prediction
accuracy. Simulation studies and applications to real world data are conducted
to systematically evaluate the performance of the proposed method in comparison
with conventional methods.
Collections
- Doctoral Dissertations [14868]