Contributions to Data Reduction and Statistical Model of Data with Complex Structures


TR Number



Journal Title

Journal ISSN

Volume Title


Virginia Tech


With advanced technology and information explosion, the data of interest often have complex structures, with the large size and dimensions in the form of continuous or discrete features. There is an emerging need for data reduction, efficient modeling, and model inference. For example, data can contain millions of observations with thousands of features. Traditional methods, such as linear regression or LASSO regression, cannot effectively deal with such a large dataset directly. This dissertation aims to develop several techniques to effectively analyze large datasets with complex structures in the observational, experimental and time series data. In Chapter 2, I focus on the data reduction for model estimation of sparse regression. The commonly-used subdata selection method often considers sampling or feature screening. Un- der the case of data with both large number of observation and predictors, we proposed a filtering approach for model estimation (FAME) to reduce both the size of data points and features. The proposed algorithm can be easily extended for data with discrete response or discrete predictors. Through simulations and case studies, the proposed method provides a good performance for parameter estimation with efficient computation. In Chapter 3, I focus on modeling the experimental data with quantitative-sequence (QS) factor. Here the QS factor concerns both quantities and sequence orders of several compo- nents in the experiment. Existing methods usually can only focus on the sequence orders or quantities of the multiple components. To fill this gap, we propose a QS transformation to transform the QS factor to a generalized permutation matrix, and consequently develop a simple Gaussian process approach to model the experimental data with QS factors. In Chapter 4, I focus on forecasting multivariate time series data by leveraging the au- toregression and clustering. Existing time series forecasting method treat each series data independently and ignore their inherent correlation. To fill this gap, I proposed a clustering based on autoregression and control the sparsity of the transition matrix estimation by adap- tive lasso and clustering coefficient. The clustering-based cross prediction can outperforms the conventional time series forecasting methods. Moreover, the the clustering result can also enhance the forecasting accuracy of other forecasting methods. The proposed method can be applied on practical data, such as stock forecasting, topic trend detection.



high-dimensional data, subdata selection, filtering approach, Analysis of experimental data, Gaussian process, Permutation matrix. QS factor, multivariate time series, spectral clustering, autoregression, cross prediction