Browsing by Author "Deng, Xinwei"
Now showing 1 - 20 of 63
Results Per Page
Sort Options
- Advancements in Degradation Modeling, Uncertainty Quantification and Spatial Variable SelectionXie, Yimeng (Virginia Tech, 2016-06-30)This dissertation focuses on three research projects: 1) construction of simultaneous prediction intervals/bounds for at least k out of m future observations; 2) semi-parametric degradation model for accelerated destructive degradation test (ADDT) data; and 3) spatial variable selection and application to Lyme disease data in Virginia. Followed by the general introduction in Chapter 1, the rest of the dissertation consists of three main chapters. Chapter 2 presents the construction of two-sided simultaneous prediction intervals (SPIs) or one-sided simultaneous prediction bounds (SPBs) to contain at least k out of m future observations, based on complete or right censored data from (log)-location-scale family of distributions. SPI/SPB calculated by the proposed procedure has exact coverage probability for complete and Type II censored data. In Type I censoring case, it has asymptotically correct coverage probability and reasonably good results for small samples. The proposed procedures can be extended to multiply-censored data or randomly censored data. Chapter 3 focuses on the analysis of ADDT data. We use a general degradation path model with correlated covariance structure to describe ADDT data. Monotone B-splines are used to modeling the underlying degradation process. A likelihood based iterative procedure for parameter estimation is developed. The confidence intervals of parameters are calculated using the nonparametric bootstrap procedure. Both simulated data and real datasets are used to compare the semi-parametric model with the existing parametric models. Chapter 4 studies the Lyme disease emergence in Virginia. The objective is to find important environmental and demographical covariates that are associated with Lyme disease emergence. To address the high-dimentional integral problem in the loglikelihood function, we consider the penalized quasi loglikelihood and the approximated loglikelihood based on Laplace approximation. We impose the adaptive elastic net penalty to obtain sparse estimation of parameters and thus to achieve variable selection of important variables. The proposed methods are investigated in simulation studies. We also apply the proposed methods to Lyme disease data in Virginia. Finally, Chapter 5 contains general conclusions and discussions for future work.
- Advancements on the Interface of Computer Experiments and Survival AnalysisWang, Yueyao (Virginia Tech, 2022-07-20)Design and analysis of computer experiments is an area focusing on efficient data collection (e.g., space-filling designs), surrogate modeling (e.g., Gaussian process models), and uncertainty quantification. Survival analysis focuses on modeling the period of time until a certain event happens. Data collection, prediction, and uncertainty quantification are also fundamental in survival models. In this dissertation, the proposed methods are motivated by a wide range of real world applications, including high-performance computing (HPC) variability data, jet engine reliability data, Titan GPU lifetime data, and pine tree survival data. This dissertation is to explore interfaces on computer experiments and survival analysis with the above applications. Chapter 1 provides a general introduction to computer experiments and survival analysis. Chapter 2 focuses on the HPC variability management application. We investigate the applicability of space-filling designs and statistical surrogates in the HPC variability management setting, in terms of design efficiency, prediction accuracy, and scalability. A comprehensive comparison of the design strategies and predictive methods is conducted to study the combinations' performance in prediction accuracy. Chapter 3 focuses on the reliability prediction application. With the availability of multi-channel sensor data, a single degradation index is needed to be compatible with most existing models. We propose a flexible framework with multi-sensory data to model the nonlinear relationship between sensors and the degradation process. We also involve the automatic variable selection to exclude sensors that have no effect on the underlying degradation process. Chapter 4 investigates inference approaches for spatial survival analysis under the Bayesian framework. The Markov chain Monte Carlo (MCMC) approaches and variational inferences performance are studied for two survival models, the cumulative exposure model and the proportional hazard (PH) model. The Titan GPU data and pine tree survival data are used to illustrate the capability of variational inference on spatial survival models. Chapter 5 provides some general conclusions.
- Alternative approaches for creating a wealth index: the case of MozambiqueXie, Kexin; Marathe, Achla; Deng, Xinwei; Ruiz-Castillo, Paula; Imputiua, Saimado; Elobolobo, Eldo; Mutepa, Victor; Sale, Mussa; Nicolas, Patricia; Montana, Julia; Jamisse, Edgar; Munguambe, Humberto; Materrula, Felisbela; Casellas, Aina; Rabinovich, Regina; Saute, Francisco; Chaccour, Carlos J.; Sacoor, Charfudin; Rist, Cassidy (BMJ, 2023-08)Introduction: The wealth index is widely used as a proxy for a household's socioeconomic position (SEP) and living standard. This work constructs a wealth index for the Mopeia district in Mozambique using data collected in year 2021 under the BOHEMIA (Broad One Health Endectocide-based Malaria Intervention in Africa) project. Methods: We evaluate the performance of three alternative approaches against the Demographic and Health Survey (DHS) method based wealth index: feature selection principal components analysis (PCA), sparse PCA and robust PCA. The internal coherence between four wealth indices is investigated through statistical testing. Validation and an evaluation of the stability of the wealth index are performed with additional household income data from the BOHEMIA Health Economics Survey and the 2018 Malaria Indicator Survey data in Mozambique. Results: The Spearman's rank correlation between wealth index ventiles from four methods is over 0.98, indicating a high consistency in results across methods. Wealth rankings and households' income show a strong concordance with the area under the curve value of ∼0.7 in the receiver operating characteristic analysis. The agreement between the alternative wealth indices and the DHS wealth index demonstrates the stability in rankings from the alternative methods. Conclusions: This study creates a wealth index for Mopeia, Mozambique, and shows that DHS method based wealth index is an appropriate proxy for the SEP in low-income regions. However, this research recommends feature selection PCA over the DHS method since it uses fewer asset indicators and constructs a high-quality wealth index.
- Analyzing Highway Safety Datasets: Simplifying Statistical Analyses from Sparse to Big DataLord, Dominique; Geedipally, Srinivas Reddy; Guo, Feng; Jahangiri, Arash; Shirazi, Mohammadali; Mao, Huiying; Deng, Xinwei (SAFE-D: Safety Through Disruption National University Transportation Center, 2019-07)Data used for safety analyses have characteristics that are not found in other disciplines. In this research, we examine three characteristics that can negatively influence the outcome of these safety analyses: (1) crash data with many zero observations; (2) the rare occurrence of crash events (not necessarily related to many zero observations); and (3) big datasets. These characteristics can lead to biased results if inappropriate analysis tools are used. The objectives of this study are to simplify the analysis of highway safety data and develop guidelines and analysis tools for handling these unique characteristics. The research provides guidelines on when to aggregate data over time and space to reduce the number of zero observations; uses heuristics for selecting statistical models; proposes a bias adjustment method for improving the estimation of risk factors; develops a decision-adjusted modeling framework for predicting risk; and shows how cluster analyses can be used to extract relevant information from big data. The guidelines and tools were developed using simulation and observed datasets. Examples are provided to illustrate the guidelines and tools.
- Assessing annual urban change and its impacts on evapotranspirationWan, Heng (Virginia Tech, 2020-06-19)Land Use Land Cover Change (LULCC) is a major component of global environmental change, which could result in huge impacts on biodiversity, water yield and quality, climate, soil condition, food security and human welfare. Of all the LULCC types, urbanization is considered to be the most impactful one. Monitoring past and current urbanization processes could provide valuable information for ecosystem services evaluation and policy-making. The National Land Cover Database (NLCD) provides land use land cover data covering the entire United States, and it is widely used as land use land cover data input in numerous environmental models. One major drawback of NLCD is that it is updated every five years, which makes it unsatisfactory for some models requiring land use land cover data with a higher temporal resolution. This dissertation integrated a rich time series of Landsat imagery and NLCD to achieve annual urban change mapping in the Washington D.C. metropolitan area by using time series data change point detection methods. Three different time series change point detection methods were tested and compared to find out the optimal one. One major limitation of using the above time series change point detection method for annual urban mapping is that it relies heavily on NLCD, thus the method is not applicable to near-real time monitoring of urban change. To achieve the near real-time urban change identification, this research applied machine learning-based classification models, including random forest and Artificial Neural Networks (ANN), to automatically detect urban changes by using a rich time series of Landsat imagery as inputs. Urban growth could result in a higher probability of flooding by reducing infiltration and evapotranspiration (ET). ET plays an important role in stormwater mitigation and flood reduction, thus assessing the changes of ET under different urban growth scenarios could yield valuable information for urban planners and policy makers. In this study, spatial-explicit annual ET data at 30-m resolution was generated for Virginia Beach by integrating daily ET data derived from METRIC model and Landsat imagery. Annual ET rates across different major land cover types were compared, and the results indicated that converting forests to urban could result in a huge deduction in ET, thus increasing flood probability. Furthermore, we developed statistical models to explain spatial ET variation using high resolution (1m) land cover data. The results showed that annual ET will increase with the increase of the canopy cover, and it would decrease with the increase of impervious cover and water table depth.
- Assessment of Penalized Regression for Genome-wide Association StudiesYi, Hui (Virginia Tech, 2014-08-27)The data from genome-wide association studies (GWAS) in humans are still predominantly analyzed using single marker association methods. As an alternative to Single Marker Analysis (SMA), all or subsets of markers can be tested simultaneously. This approach requires a form of Penalized Regression (PR) as the number of SNPs is much larger than the sample size. Here we review PR methods in the context of GWAS, extend them to perform penalty parameter and SNP selection by False Discovery Rate (FDR) control, and assess their performance (including penalties incorporating linkage disequilibrium) in comparison with SMA. PR methods were compared with SMA on realistically simulated GWAS data consisting of genotype data from single and multiple chromosomes and a continuous phenotype and on real data. Based on our comparisons our analytic FDR criterion may currently be the best approach to SNP selection using PR for GWAS. We found that PR with FDR control provides substantially more power than SMA with genome-wide type-I error control but somewhat less power than SMA with Benjamini-Hochberg FDR control. PR controlled the FDR conservatively while SMA-BH may not achieve FDR control in all situations. Differences among PR methods seem quite small when the focus is on variable selection with FDR control. Incorporating LD into PR by adapting penalties developed for covariates measured on graphs can improve power but also generate morel false positives or wider regions for follow-up. We recommend using the Elastic Net with a mixing weight for the Lasso penalty near 0.5 as the best method.
- Bayesian Inference Based on Nonparametric Regression for Highly Correlated and High Dimensional DataYun, Young Ho (Virginia Tech, 2024-12-13)Establishing relationships among observed variables is important in many research studies. However, the task becomes increasingly difficult in the presence of unidentified complexities stemming from interdependencies among multi-dimensional variables and variability across subjects. This dissertation presents three novel methodological approaches to address these complex associations between highly correlated and high dimensional data. Firstly, group multi-kernel machine regression (GMM) is proposed to identify the association between two sets of multidimensional functions, offering flexibility to effectively capture the complex association among high-dimensional variables. Secondly, semiparametric kernel machine regression under a Bayesian hierarchical structure is introduced for matched case-crossover studies, enabling flexible modeling of multiple covariate effects within strata and their complex interactions, denoted as fused kernel machine regression (Fused-KMR). Lastly, it presents a Bayesian hierarchical framework designed to identify multiple change points in the relationship between ambient temperature and mortality rate. This framework, unlike traditional methods, treats change points as random variables, enabling the modeling of nonparametric functions that vary by region and is denoted as a multiple random change point (MRCP). Simulation studies and real-world applications illustrate the effectiveness and advantages of these approaches in capturing intricate associations and enhancing predictive accuracy.
- Bayesian Modeling of Complex High-Dimensional DataHuo, Shuning (Virginia Tech, 2020-12-07)With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional complex data in different forms, such as medical images, genomics measurements. However, acquisition of more data does not automatically lead to better knowledge discovery. One needs efficient and reliable analytical tools to extract useful information from complex datasets. The main objective of this dissertation is to develop innovative Bayesian methodologies to enable effective and efficient knowledge discovery from complex high-dimensional data. It contains two parts—the development of computationally efficient functional mixed models and the modeling of data heterogeneity via Dirichlet Diffusion Tree. The first part focuses on tackling the computational bottleneck in Bayesian functional mixed models. We propose a computational framework called variational functional mixed model (VFMM). This new method facilitates efficient data compression and high-performance computing in basis space. We also propose a new multiple testing procedure in basis space, which can be used to detect significant local regions. The effectiveness of the proposed model is demonstrated through two datasets, a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part is about modeling data heterogeneity by using Dirichlet Diffusion Trees. We propose a Bayesian latent tree model that incorporates covariates of subjects to characterize the heterogeneity and uncover the latent tree structure underlying data. This innovative model may reveal the hierarchical evolution process through branch structures and estimate systematic differences between groups of samples. We demonstrate the effectiveness of the model through the simulation study and a brain tumor real data.
- Bayesian Multilevel-multiclass Graphical ModelLin, Jiali (Virginia Tech, 2019-06-21)Gaussian graphical model has been a popular tool to investigate conditional dependency between random variables by estimating sparse precision matrices. Two problems have been discussed. One is to learn multiple Gaussian graphical models at multilevel from unknown classes. Another one is to select Gaussian process in semiparametric multi-kernel machine regression. The first problem is approached by Gaussian graphical model. In this project, I consider learning multiple connected graphs among multilevel variables from unknown classes. I esti- mate the classes of the observations from the mixture distributions by evaluating the Bayes factor and learn the network structures by fitting a novel neighborhood selection algorithm. This approach is able to identify the class membership and to reveal network structures for multilevel variables simultaneously. Unlike most existing methods that solve this problem by frequentist approaches, I assess an alternative to a novel hierarchical Bayesian approach to incorporate prior knowledge. The second problem focuses on the analysis of correlated high-dimensional data which has been useful in many applications. In this work, I consider a problem of detecting signals with a semiparametric regression model which can study the effects of fixed covariates (e.g. clinical variables) and sets of elements (e.g. pathways of genes). I model the unknown high-dimension functions of multi-sets via multi-Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, my variable selection can be considered as Gaussian process selection. I develop my Gaussian process selection under the Bayesian variable selection framework.
- Bivariate functional data clustering: grouping streams based on a varying coefficient model of the stream water and air temperature relationshipLi, H.; Deng, Xinwei; Dolloff, C. Andrew; Smith, Eric P. (2016-02)A novel clustering method for bivariate functional data is proposed to group streams based on their water-air temperature relationship. A distance measure is developed for bivariate curves by using a time-varying coefficient model and a weighting scheme. This distance is also adjusted by spatial correlation of streams via the variogram. Therefore, the proposed distance not only measures the difference among the streams with respect to their water-air temperature relationship but also accounts for spatial correlation among the streams. The proposed clustering method is applied to 62 streams in Southeast US that have paired air-water temperature measured over a ten-month period. The results show that streams in the same cluster reflect common characteristics such as solar radiation, percent forest and elevation. Copyright (C) 2015 John Wiley & Sons, Ltd
- BOHEMIA a cluster randomized trial to assess the impact of an endectocide-based one health approach to malaria in Mozambique: baseline demographics and key malaria indicatorsRuiz-Castillo, Paula; Imputiua, Saimado; Xie, Kexin; Elobolobo, Eldo; Nicolas, Patricia; Montaña, Julia; Jamisse, Edgar; Munguambe, Humberto; Materrula, Felisbela; Casellas, Aina; Deng, Xinwei; Marathe, Achla; Rabinovich, Regina; Saute, Francisco; Chaccour, Carlos; Sacoor, Charfudin (2023-06-04)Background Many geographical areas of sub-Saharan Africa, especially in rural settings, lack complete and up-to-date demographic data, posing a challenge for implementation and evaluation of public health interventions and carrying out large-scale health research. A demographic survey was completed in Mopeia district, located in the Zambezia province in Mozambique, to inform the Broad One Health Endectocide-based Malaria Intervention in Africa (BOHEMIA) cluster randomized clinical trial, which tested ivermectin mass drug administration to humans and/or livestock as a potential novel strategy to decrease malaria transmission. Methods The demographic survey was a prospective descriptive study, which collected data of all the households in the district that accepted to participate. Households were mapped through geolocation and identified with a unique identification number. Basic demographic data of the household members was collected and each person received a permanent identification number for the study. Results 25,550 households were mapped and underwent the demographic survey, and 131,818 individuals were registered in the district. The average household size was 5 members and 76.9% of households identified a male household head. Housing conditions are often substandard with low access to improved water systems and electricity. The reported coverage of malaria interventions was 71.1% for indoor residual spraying and 54.1% for universal coverage of long-lasting insecticidal nets. The median age of the population was 15 years old. There were 910 deaths in the previous 12 months reported, and 43.9% were of children less than 5 years of age. Conclusions The study showed that the district had good coverage of vector control tools against malaria but sub-optimal living conditions and poor access to basic services. The majority of households are led by males and Mopeia Sede/Cuacua is the most populated locality in the district. The population of Mopeia is young (< 15 years) and there is a high childhood mortality. The results of this survey were crucial as they provided the household and population profiles and allowed the design and implementation of the cluster randomized clinical trial. Trial registration NCT04966702.
- Bridging Machine Learning and Experimental Design for Enhanced Data Analysis and OptimizationGuo, Qing (Virginia Tech, 2024-07-19)Experimental design is a powerful tool for gathering highly informative observations using a small number of experiments. The demand for smart data collection strategies is increasing due to the need to save time and budget, especially in online experiments and machine learning. However, the traditional experimental design method falls short in systematically assessing changing variables' effects. Specifically within Artificial Intelligence (AI), the challenge lies in assessing the impacts of model structures and training strategies on task performances with a limited number of trials. This shortfall underscores the necessity for the development of novel approaches. On the other side, the optimal design criterion has typically been model-based in classic design literature, which leads to restricting the flexibility of experimental design strategies. However, machine learning's inherent flexibility can empower the estimation of metrics efficiently using nonparametric and optimization techniques, thereby broadening the horizons of experimental design possibilities. In this dissertation, the aim is to develop a set of novel methods to bridge the merits between these two domains: 1) applying ideas from statistical experimental design to enhance data efficiency in machine learning, and 2) leveraging powerful deep neural networks to optimize experimental design strategies. This dissertation consists of 5 chapters. Chapter 1 provides a general introduction to mutual information, fractional factorial design, hyper-parameter tuning, multi-modality, etc. In Chapter 2, I propose a new mutual information estimator FLO by integrating techniques from variational inference (VAE), contrastive learning, and convex optimization. I apply FLO to broad data science applications, such as efficient data collection, transfer learning, fair learning, etc. Chapter 3 introduces a new design strategy called multi-layer sliced design (MLSD) with the application of AI assurance. It focuses on exploring the effects of hyper-parameters under different models and optimization strategies. Chapter 4 investigates classic vision challenges via multimodal large language models by implicitly optimizing mutual information and thoroughly exploring training strategies. Chapter 5 concludes this proposal and discusses several future research topics.
- Bridging the Gap: Selected Problems in Model Specification, Estimation, and Optimal Design from Reliability and Lifetime Data AnalysisKing, Caleb B. (Virginia Tech, 2015-04-13)Understanding the lifetime behavior of their products is crucial to the success of any company in the manufacturing and engineering industries. Statistical methods for lifetime data are a key component to achieving this level of understanding. Sometimes a statistical procedure must be updated to be adequate for modeling specific data as is discussed in Chapter 2. However, there are cases in which the methods used in industrial standards are themselves inadequate. This is distressing as more appropriate statistical methods are available but remain unused. The research in Chapter 4 deals with such a situation. The research in Chapter 3 serves as a combination of both scenarios and represents how both statisticians and engineers from the industry can join together to yield beautiful results. After introducing basic concepts and notation in Chapter 1, Chapter 2 focuses on lifetime prediction for a product consisting of multiple components. During the production period, some components may be upgraded or replaced, resulting in a new ``generation" of component. Incorporating this information into a competing risks model can greatly improve the accuracy of lifetime prediction. A generalized competing risks model is proposed and simulation is used to assess its performance. In Chapter 3, optimal and compromise test plans are proposed for constant amplitude fatigue testing. These test plans are based on a nonlinear physical model from the fatigue literature that is able to better capture the nonlinear behavior of fatigue life and account for effects from the testing environment. Sensitivity to the design parameters and modeling assumptions are investigated and suggestions for planning strategies are proposed. Chapter 4 considers the analysis of ADDT data for the purposes of estimating a thermal index. The current industry standards use a two-step procedure involving least squares regression in each step. The methodology preferred in the statistical literature is the maximum likelihood procedure. A comparison of the procedures is performed and two published datasets are used as motivating examples. The maximum likelihood procedure is presented as a more viable alternative to the two-step procedure due to its ability to quantify uncertainty in data inference and modeling flexibility.
- The CCP Selector: Scalable Algorithms for Sparse Ridge Regression from Chance-Constrained ProgrammingXie, Weijun; Deng, Xinwei (2018-06-11)Sparse regression and variable selection for large-scale data have been rapidly developed in the past decades. This work focuses on sparse ridge regression, which considers the exact $L_0$ norm to pursue the sparsity. We pave out a theoretical foundation to understand why many existing approaches may not work well for this problem, in particular on large-scale datasets. Inspired by reformulating the problem as a chance-constrained program, we derive a novel mixed integer second order conic (MISOC) reformulation and prove that its continuous relaxation is equivalent to that of the convex integer formulation proposed in a recent work. Based upon these two formulations, we develop two new scalable algorithms, the greedy and randomized algorithms, for sparse ridge regression with desirable theoretical properties. The proposed algorithms are proved to yield near-optimal solutions under mild conditions. In the case of much larger dimensions, we propose to integrate the greedy algorithm with the randomized algorithm, which can greedily search the features from the nonzero subset identified by the continuous relaxation of the MISOC formulation. The merits of the proposed methods are elaborated through a set of numerical examples in comparison with several existing ones.
- Change Detection and Analysis of Data with Heterogeneous StructuresChu, Shuyu (Virginia Tech, 2017-07-28)Heterogeneous data with different characteristics are ubiquitous in the modern digital world. For example, the observations collected from a process may change on its mean or variance. In numerous applications, data are often of mixed types including both discrete and continuous variables. Heterogeneity also commonly arises in data when underlying models vary across different segments. Besides, the underlying pattern of data may change in different dimensions, such as in time and space. The diversity of heterogeneous data structures makes statistical modeling and analysis challenging. Detection of change-points in heterogeneous data has attracted great attention from a variety of application areas, such as quality control in manufacturing, protest event detection in social science, purchase likelihood prediction in business analytics, and organ state change in the biomedical engineering. However, due to the extraordinary diversity of the heterogeneous data structures and complexity of the underlying dynamic patterns, the change-detection and analysis of such data is quite challenging. This dissertation aims to develop novel statistical modeling methodologies to analyze four types of heterogeneous data and to find change-points efficiently. The proposed approaches have been applied to solve real-world problems and can be potentially applied to a broad range of areas.
- Computational Framework for Uncertainty Quantification, Sensitivity Analysis and Experimental Design of Network-based Computer Simulation ModelsWu, Sichao (Virginia Tech, 2017-08-29)When capturing a real-world, networked system using a simulation model, features are usually omitted or represented by probability distributions. Verification and validation (V and V) of such models is an inherent and fundamental challenge. Central to V and V, but also to model analysis and prediction, are uncertainty quantification (UQ), sensitivity analysis (SA) and design of experiments (DOE). In addition, network-based computer simulation models, as compared with models based on ordinary and partial differential equations (ODE and PDE), typically involve a significantly larger volume of more complex data. Efficient use of such models is challenging since it requires a broad set of skills ranging from domain expertise to in-depth knowledge including modeling, programming, algorithmics, high- performance computing, statistical analysis, and optimization. On top of this, the need to support reproducible experiments necessitates complete data tracking and management. Finally, the lack of standardization of simulation model configuration formats presents an extra challenge when developing technology intended to work across models. While there are tools and frameworks that address parts of the challenges above, to the best of our knowledge, none of them accomplishes all this in a model-independent and scientifically reproducible manner. In this dissertation, we present a computational framework called GENEUS that addresses these challenges. Specifically, it incorporates (i) a standardized model configuration format, (ii) a data flow management system with digital library functions helping to ensure scientific reproducibility, and (iii) a model-independent, expandable plugin-type library for efficiently conducting UQ/SA/DOE for network-based simulation models. This framework has been applied to systems ranging from fundamental graph dynamical systems (GDSs) to large-scale socio-technical simulation models with a broad range of analyses such as UQ and parameter studies for various scenarios. Graph dynamical systems provide a theoretical framework for network-based simulation models and have been studied theoretically in this dissertation. This includes a broad range of stability and sensitivity analyses offering insights into how GDSs respond to perturbations of their key components. This stability-focused, structure-to-function theory was a motivator for the design and implementation of GENEUS. GENEUS, rooted in the framework of GDS, provides modelers, experimentalists, and research groups access to a variety of UQ/SA/DOE methods with robust and tested implementations without requiring them to necessarily have the detailed expertise in statistics, data management and computing. Even for research teams having all the skills, GENEUS can significantly increase research productivity.
- Computer Experimental Design for Gaussian Process SurrogatesZhang, Boya (Virginia Tech, 2020-09-01)With a rapid development of computing power, computer experiments have gained popularity in various scientific fields, like cosmology, ecology and engineering. However, some computer experiments for complex processes are still computationally demanding. A surrogate model or emulator, is often employed as a fast substitute for the simulator. Meanwhile, a common challenge in computer experiments and related fields is to efficiently explore the input space using a small number of samples, i.e., the experimental design problem. This dissertation focuses on the design problem under Gaussian process surrogates. The first work demonstrates empirically that space-filling designs disappoint when the model hyperparameterization is unknown, and must be estimated from data observed at the chosen design sites. A purely random design is shown to be superior to higher-powered alternatives in many cases. Thereafter, a new family of distance-based designs are proposed and their superior performance is illustrated in both static (one-shot design) and sequential settings. The second contribution is motivated by an agent-based model(ABM) of delta smelt conservation. The ABM is developed to assist in a study of delta smelt life cycles and to understand sensitivities to myriad natural variables and human interventions. However, the input space is high-dimensional, running the simulator is time-consuming, and its outputs change nonlinearly in both mean and variance. A batch sequential design scheme is proposed, generalizing one-at-a-time variance-based active learning, as a means of keeping multi-core cluster nodes fully engaged with expensive runs. The acquisition strategy is carefully engineered to favor selection of replicates which boost statistical and computational efficiencies. Design performance is illustrated on a range of toy examples before embarking on a smelt simulation campaign and downstream high-fidelity input sensitivity analysis.
- Contributions to Data Reduction and Statistical Model of Data with Complex StructuresWei, Yanran (Virginia Tech, 2022-08-30)With advanced technology and information explosion, the data of interest often have complex structures, with the large size and dimensions in the form of continuous or discrete features. There is an emerging need for data reduction, efficient modeling, and model inference. For example, data can contain millions of observations with thousands of features. Traditional methods, such as linear regression or LASSO regression, cannot effectively deal with such a large dataset directly. This dissertation aims to develop several techniques to effectively analyze large datasets with complex structures in the observational, experimental and time series data. In Chapter 2, I focus on the data reduction for model estimation of sparse regression. The commonly-used subdata selection method often considers sampling or feature screening. Un- der the case of data with both large number of observation and predictors, we proposed a filtering approach for model estimation (FAME) to reduce both the size of data points and features. The proposed algorithm can be easily extended for data with discrete response or discrete predictors. Through simulations and case studies, the proposed method provides a good performance for parameter estimation with efficient computation. In Chapter 3, I focus on modeling the experimental data with quantitative-sequence (QS) factor. Here the QS factor concerns both quantities and sequence orders of several compo- nents in the experiment. Existing methods usually can only focus on the sequence orders or quantities of the multiple components. To fill this gap, we propose a QS transformation to transform the QS factor to a generalized permutation matrix, and consequently develop a simple Gaussian process approach to model the experimental data with QS factors. In Chapter 4, I focus on forecasting multivariate time series data by leveraging the au- toregression and clustering. Existing time series forecasting method treat each series data independently and ignore their inherent correlation. To fill this gap, I proposed a clustering based on autoregression and control the sparsity of the transition matrix estimation by adap- tive lasso and clustering coefficient. The clustering-based cross prediction can outperforms the conventional time series forecasting methods. Moreover, the the clustering result can also enhance the forecasting accuracy of other forecasting methods. The proposed method can be applied on practical data, such as stock forecasting, topic trend detection.
- Contributions to Efficient Statistical Modeling of Complex Data with Temporal StructuresHu, Zhihao (Virginia Tech, 2022-03-03)This dissertation will focus on three research projects: Neighborhood vector auto regression in multivariate time series, uncertainty quantification for agent-based modeling networked anagrams, and a scalable algorithm for multi-class classification. The first project studies the modeling of multivariate time series, with the applications in the environmental sciences and other areas. In this work, a so-called neighborhood vector autoregression (NVAR) model is proposed to efficiently analyze large-dimensional multivariate time series. The time series are assumed to have underlying distances among them based on the inherent setting of the problem. When this distance matrix is available or can be obtained, the proposed NVAR method is demonstrated to provides a computationally efficient and theoretically sound estimation of model parameters. The performance of the proposed method is compared with other existing approaches in both simulation studies and a real application of stream nitrogen study. The second project focuses on the study of group anagram games. In a group anagram game, players are provided letters to form as many words as possible. In this work, the enhanced agent behavior models for networked group anagram games are built, exercised, and evaluated under an uncertainty quantification framework. Specifically, the game data for players is clustered based on their skill levels (forming words, requesting letters, and replying to requests), the multinomial logistic regressions for transition probabilities are performed, and the uncertainty is quantified within each cluster. The result of this process is a model where players are assigned different numbers of neighbors and different skill levels in the game. Simulations of ego agents with neighbors are conducted to demonstrate the efficacy of the proposed methods. The third project aims to develop efficient and scalable algorithms for multi-class classification, which achieve a balance between prediction accuracy and computing efficiency, especially in high dimensional settings. The traditional multinomial logistic regression becomes slow in high dimensional settings where the number of classes (M) and the number of features (p) is large. Our algorithms are computing efficiently and scalable to data with even higher dimensions. The simulation and case study results demonstrate that our algorithms have huge advantage over traditional multinomial logistic regressions, and maintains comparable prediction performance.
- Contributions to Large Covariance and Inverse Covariance Matrices EstimationKang, Xiaoning (Virginia Tech, 2016-08-25)Estimation of covariance matrix and its inverse is of great importance in multivariate statistics with broad applications such as dimension reduction, portfolio optimization, linear discriminant analysis and gene expression analysis. However, accurate estimation of covariance or inverse covariance matrices is challenging due to the positive definiteness constraint and large number of parameters, especially in the high-dimensional cases. In this thesis, I develop several approaches for estimating large covariance and inverse covariance matrices with different applications. In Chapter 2, I consider an estimation of time-varying covariance matrices in the analysis of multivariate financial data. An order-invariant Cholesky-log-GARCH model is developed for estimating the time-varying covariance matrices based on the modified Cholesky decomposition. This decomposition provides a statistically interpretable parametrization of the covariance matrix. The key idea of the proposed model is to consider an ensemble estimation of covariance matrix based on the multiple permutations of variables. Chapter 3 investigates the sparse estimation of inverse covariance matrix for the highdimensional data. This problem has attracted wide attention, since zero entries in the inverse covariance matrix imply the conditional independence among variables. I propose an orderinvariant sparse estimator based on the modified Cholesky decomposition. The proposed estimator is obtained by assembling a set of estimates from the multiple permutations of variables. Hard thresholding is imposed on the ensemble Cholesky factor to encourage the sparsity in the estimated inverse covariance matrix. The proposed method is able to catch the correct sparse structure of the inverse covariance matrix. Chapter 4 focuses on the sparse estimation of large covariance matrix. Traditional estimation approach is known to perform poorly in the high dimensions. I propose a positive-definite estimator for the covariance matrix using the modified Cholesky decomposition. Such a decomposition provides a exibility to obtain a set of covariance matrix estimates. The proposed method considers an ensemble estimator as the center" of these available estimates with respect to Frobenius norm. The proposed estimator is not only guaranteed to be positive definite, but also able to catch the underlying sparse structure of the true matrix.