Browsing by Author "Higdon, David"
Now showing 1 - 20 of 20
Results Per Page
Sort Options
- Advanced Nonparametric Bayesian Functional ModelingGao, Wenyu (Virginia Tech, 2020-09-04)Functional analyses have gained more interest as we have easier access to massive data sets. However, such data sets often contain large heterogeneities, noise, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This dissertation considers noisy information reduction in functional analyses from two perspectives: functional variable selection to reduce the dimensionality and functional clustering to group similar observations and thus reduce the sample size. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model, or developed from a more generic one by changing the prior distributions. Hence, this dissertation focuses on the development of Bayesian approaches for functional analyses due to their flexibilities. A nonparametric Bayesian approach, such as the Dirichlet process mixture (DPM) model, has a nonparametric distribution as the prior. This approach provides flexibility and reduces assumptions, especially for functional clustering, because the DPM model has an automatic clustering property, so the number of clusters does not need to be specified in advance. Furthermore, a weighted Dirichlet process mixture (WDPM) model allows for more heterogeneities from the data by assuming more than one unknown prior distribution. It also gathers more information from the data by introducing a weight function that assigns different candidate priors, such that the less similar observations are more separated. Thus, the WDPM model will improve the clustering and model estimation results. In this dissertation, we used an advanced nonparametric Bayesian approach to study functional variable selection and functional clustering methods. We proposed 1) a stochastic search functional selection method with application to 1-M matched case-crossover studies for aseptic meningitis, to examine the time-varying unknown relationship and find out important covariates affecting disease contractions; 2) a functional clustering method via the WDPM model, with application to three pathways related to genetic diabetes data, to identify essential genes distinguishing between normal and disease groups; and 3) a combined functional clustering, with the WDPM model, and variable selection approach with application to high-frequency spectral data, to select wavelengths associated with breast cancer racial disparities.
- Bayesian Uncertainty Quantification while Leveraging Multiple Computer Model RunsWalsh, Stephen A. (Virginia Tech, 2023-06-22)In the face of spatially correlated data, Gaussian process regression is a very common modeling approach. Given observational data, kriging equations will provide the best linear unbiased predictor for the mean at unobserved locations. However, when a computer model provides a complete grid of forecasted values, kriging will not apply. To develop an approach to quantify uncertainty of computer model output in this setting, we leverage information from a collection of computer model runs (e.g., historical forecast and observation pairs for tropical cyclone precipitation totals) through a Bayesian hierarchical framework. This framework allows us to combine information and account for the spatial correlation within and across computer model output. Using maximum likelihood estimates and the corresponding Hessian matrices for Gaussian process parameters, these are input to a Gibbs sampler which provides posterior distributions for parameters of interest. These samples are used to generate predictions which provide uncertainty quantification for a given computer model run (e.g., tropical cyclone precipitation forecast). We then extend this framework using deep Gaussian processes to allow for nonstationary covariance structure, applied to multiple computer model runs from a cosmology application. We also perform sensitivity analyses to understand which parameter inputs most greatly impact cosmological computer model output.
- Computer Experimental Design for Gaussian Process SurrogatesZhang, Boya (Virginia Tech, 2020-09-01)With a rapid development of computing power, computer experiments have gained popularity in various scientific fields, like cosmology, ecology and engineering. However, some computer experiments for complex processes are still computationally demanding. A surrogate model or emulator, is often employed as a fast substitute for the simulator. Meanwhile, a common challenge in computer experiments and related fields is to efficiently explore the input space using a small number of samples, i.e., the experimental design problem. This dissertation focuses on the design problem under Gaussian process surrogates. The first work demonstrates empirically that space-filling designs disappoint when the model hyperparameterization is unknown, and must be estimated from data observed at the chosen design sites. A purely random design is shown to be superior to higher-powered alternatives in many cases. Thereafter, a new family of distance-based designs are proposed and their superior performance is illustrated in both static (one-shot design) and sequential settings. The second contribution is motivated by an agent-based model(ABM) of delta smelt conservation. The ABM is developed to assist in a study of delta smelt life cycles and to understand sensitivities to myriad natural variables and human interventions. However, the input space is high-dimensional, running the simulator is time-consuming, and its outputs change nonlinearly in both mean and variance. A batch sequential design scheme is proposed, generalizing one-at-a-time variance-based active learning, as a means of keeping multi-core cluster nodes fully engaged with expensive runs. The acquisition strategy is carefully engineered to favor selection of replicates which boost statistical and computational efficiencies. Design performance is illustrated on a range of toy examples before embarking on a smelt simulation campaign and downstream high-fidelity input sensitivity analysis.
- Data-driven Methods in Mechanical Model Calibration and Prediction for Mesostructured MaterialsKim, Jee Yun (Virginia Tech, 2018-10-01)Mesoscale design involving control of material distribution pattern can create a statistically heterogeneous material system, which has shown increased adaptability to complex mechanical environments involving highly non-uniform stress fields. Advances in multi-material additive manufacturing can aid in this mesoscale design, providing voxel level control of material property. This vast freedom in design space also unlocks possibilities within optimization of the material distribution pattern. The optimization problem can be divided into a forward problem focusing on accurate predication and an inverse problem focusing on efficient search of the optimal design. In the forward problem, the physical behavior of the material can be modeled based on fundamental mechanics laws and simulated through finite element analysis (FEA). A major limitation in modeling is the unknown parameters in constitutive equations that describe the constituent materials; determining these parameters via conventional single material testing has been proven to be insufficient, which necessitates novel and effective approaches of calibration. A calibration framework based in Bayesian inference, which integrates data from simulations and physical experiments, has been applied to a study involving a mesostructured material fabricated by fused deposition modeling. Calibration results provide insights on what values these parameters converge to as well as which material parameters the model output has the largest dependence on while accounting for sources of uncertainty introduced during the modeling process. Additionally, this statistical formulation is able to provide quick predictions of the physical system by implementing a surrogate and discrepancy model. The surrogate model is meant to be a statistical representation of the simulation results, circumventing issues arising from computational load, while the discrepancy is aimed to account for the difference between the simulation output and physical experiments. In this thesis, this Bayesian calibration framework is applied to a material bending problem, where in-situ mechanical characterization data and FEA simulations based on constitutive modeling are combined to produce updated values of the unknown material parameters with uncertainty.
- Deep Gaussian Process Surrogates for Computer ExperimentsSauer, Annie Elizabeth (Virginia Tech, 2023-04-27)Deep Gaussian processes (DGPs) upgrade ordinary GPs through functional composition, in which intermediate GP layers warp the original inputs, providing flexibility to model non-stationary dynamics. Recent applications in machine learning favor approximate, optimization-based inference for fast predictions, but applications to computer surrogate modeling - with an eye towards downstream tasks like Bayesian optimization and reliability analysis - demand broader uncertainty quantification (UQ). I prioritize UQ through full posterior integration in a Bayesian scheme, hinging on elliptical slice sampling of latent layers. I demonstrate how my DGP's non-stationary flexibility, combined with appropriate UQ, allows for active learning: a virtuous cycle of data acquisition and model updating that departs from traditional space-filling designs and yields more accurate surrogates for fixed simulation effort. I propose new sequential design schemes that rely on optimization of acquisition criteria through evaluation of strategically allocated candidates instead of numerical optimizations, with a motivating application to contour location in an aeronautics simulation. Alternatively, when simulation runs are cheap and readily available, large datasets present a challenge for full DGP posterior integration due to cubic scaling bottlenecks. For this case I introduce the Vecchia approximation, popular for ordinary GPs in spatial data settings. I show that Vecchia-induced sparsity of Cholesky factors allows for linear computational scaling without compromising DGP accuracy or UQ. I vet both active learning and Vecchia-approximated DGPs on numerous illustrative examples and real computer experiments. I provide open-source implementations in the "deepgp" package for R on CRAN.
- Efficient computer experiment designs for Gaussian process surrogatesCole, David Austin (Virginia Tech, 2021-06-28)Due to advancements in supercomputing and algorithms for finite element analysis, today's computer simulation models often contain complex calculations that can result in a wealth of knowledge. Gaussian processes (GPs) are highly desirable models for computer experiments for their predictive accuracy and uncertainty quantification. This dissertation addresses GP modeling when data abounds as well as GP adaptive design when simulator expense severely limits the amount of collected data. For data-rich problems, I introduce a localized sparse covariance GP that preserves the flexibility and predictive accuracy of a GP's predictive surface while saving computational time. This locally induced Gaussian process (LIGP) incorporates latent design points, inducing points, with a local Gaussian process built from a subset of the data. Various methods are introduced for the design of the inducing points. LIGP is then extended to adapt to stochastic data with replicates, estimating noise while relying upon the unique design locations for computation. I also address the goal of identifying a contour when data collection resources are limited through entropy-based adaptive design. Unlike existing methods, the entropy-based contour locator (ECL) adaptive design promotes exploration in the design space, performing well in higher dimensions and when the contour corresponds to a high/low quantile. ECL adaptive design can join with importance sampling for the purpose of reducing uncertainty in reliability estimation.
- Extensions of Weighted Multidimensional Scaling with Statistics for Data Visualization and Process MonitoringKodali, Lata (Virginia Tech, 2020-09-04)This dissertation is the compilation of two major innovations that rely on a common technique known as multidimensional scaling (MDS). MDS is a dimension-reduction method that takes high-dimensional data and creates low-dimensional versions. Project 1: Visualizations are useful when learning from high-dimensional data. However, visualizations, just as any data summary, can be misleading when they do not incorporate measures of uncertainty; e.g., uncertainty from the data or the dimension reduction algorithm used to create the visual display. We incorporate uncertainty into visualizations created by a weighted version of MDS called WMDS. Uncertainty exists in these visualizations on the variable weights, the coordinates of the display, and the fit of WMDS. We quantify these uncertainties using Bayesian models in a method we call Informative Probabilistic WMDS (IP-WMDS). Visually, we display estimated uncertainty in the form of color and ellipses, and practically, these uncertainties reflect trust in WMDS. Our results show that these displays of uncertainty highlight different aspects of the visualization, which can help inform analysts. Project 2: Analysis of network data has emerged as an active research area in statistics. Much of the focus of ongoing research has been on static networks that represent a single snapshot or aggregated historical data unchanging over time. However, most networks result from temporally-evolving systems that exhibit intrinsic dynamic behavior. Monitoring such temporally-varying networks to detect anomalous changes has applications in both social and physical sciences. In this work, we simulate data from models that rely on MDS, and we perform an evaluation study of the use of summary statistics for anomaly detection by incorporating principles from statistical process monitoring. In contrast to most previous studies, we deliberately incorporate temporal auto-correlation in our study. Other considerations in our comprehensive assessment include types and duration of anomaly, model type, and sparsity in temporally-evolving networks. We conclude that the use of summary statistics can be valuable tools for network monitoring and often perform better than more involved techniques.
- Gradient-Based Sensitivity Analysis with KernelsWycoff, Nathan Benjamin (Virginia Tech, 2021-08-20)Emulation of computer experiments via surrogate models can be difficult when the number of input parameters determining the simulation grows any greater than a few dozen. In this dissertation, we explore dimension reduction in the context of computer experiments. The active subspace method is a linear dimension reduction technique which uses the gradients of a function to determine important input directions. Unfortunately, we cannot expect to always have access to the gradients of our black-box functions. We thus begin by developing an estimator for the active subspace of a function using kernel methods to indirectly estimate the gradient. We then demonstrate how to deploy the learned input directions to improve the predictive performance of local regression models by ``undoing" the active subspace. Finally, we develop notions of sensitivities which are local to certain parts of the input space, which we then use to develop a Bayesian optimization algorithm which can exploit locally important directions.
- Inference for Populations: Uncertainty Propagation via Bayesian Population SynthesisGrubb, Christopher Thomas (Virginia Tech, 2023-08-16)In this dissertation, we develop a new type of prior distribution, specifically for populations themselves, which we denote the Dirichlet Spacing prior. This prior solves a specific problem that arises when attempting to create synthetic populations from a known subset: the unfortunate reality that assuming independence between population members means that every synthetic population will be essentially the same. This is a problem because any model which only yields one result (several very similar results), when we have very incomplete information, is fundamentally flawed. We motivate our need for this new class of priors using Agent-based Models, though this prior could be used in any situation requiring synthetic populations.
- Multiscale and Dirichlet Methods for Supply Chain Order SimulationSabin, Robert Paul Travers (Virginia Tech, 2019-04-23)Supply chains are complex systems. Researchers in the Social and Decision Analytics Laboratory (SDAL) at Virginia Tech worked with a major global supply chain company to simulate an end-to-end supply chain. The supply chain data includes raw materials, production lines, inventory, customer orders, and shipments. Including contributions of this author, Pires, Sabin, Higdon et al. (2017) developed simulations for the production, customer orders, and shipments. Customer orders are at the center of understanding behavior in a supply chain. This dissertation continues the supply chain simulation work by improving the order simulation. Orders come from a diverse set of customers with different habits. These habits can differ when it comes to which products they order, how often they order, how spaced out those orders times are, and how much of each of those products are ordered. This dissertation is unique in that it relies extensively on Dirichlet and multiscale methods to tackle supply-chain order simulation. Multiscale model methodology is furthered to include Dirichlet models which are used to simulate order times for each customer and the collective system on different scales.
- North American Tree Bat (Genera: Lasiurus, Lasionycteris) Migration on the Mid-Atlantic Coast—Implications and Discussion for Current and Future Offshore Wind DevelopmentTrue, Michael C. (Virginia Tech, 2022-01-18)In eastern North America, "tree bats" (Genera: Lasiurus and Lasionycteris) are highly susceptible to collisions with wind energy turbines and are known to fly offshore during migration. This raises concern about ongoing expansion of offshore wind-energy development off the Atlantic Coast. Season, atmospheric conditions, and site-level characteristics such as local habitat features (e.g., forest coverage) have been shown to influence wind turbine collision rates by bats onshore, and similar features may be related to risk offshore. In response to rapidly developing offshore wind energy development, I assessed the factors affecting coastal and offshore presence of tree bats. I continuously gathered tree bat nightly occurrence data using stationary acoustic recorders on five structures (four lighthouses on barrier islands and one light tower offshore) off the coast of Virginia, USA, across all seasons, 2012–2019. I used generalized additive models to describe nightly tree bat occurrence in relation to multiple factors. I found that sites either indicated maternity or migratory patterns in their seasonal occurrence pattern that were associated with local roosting resources (i.e., presence of forest). Across all sites, nightly occurrence was negatively related to wind speed and positively related to temperature and visibility. Using predictive performance metrics, I concluded that the model was highly predictive for the Virginia coast. My findings were consistent with other studies—tree bat occurrence probability and presumed mortality risk to offshore wind-energy collisions is highest on nights with low wind speed, high temperature and visibility during spring and fall. The high predictive model performance I observed provides a basis for which managers, using a similar monitoring and modeling regime, could develop an effective curtailment-based mitigation strategy. Although information at fixed points is helpful for managing specific sites, large questions remain on certain aspects of tree bat migration, in part because direct evidence (i.e., tracking of individuals) has been difficult to obtain so far. For instance, patterns in fall behavior such as the timing of migration events, the existence of migratory pathways, consistencies in the direction of travel, the drivers of over-water flight, and the activity states of residents (or bats in stopover) remain unstudied in the mid-Atlantic. The recently established Motus Wildlife Tracking System, an array of ground-based receiver stations, provides a new technique to track individual bats via the ability to detect course-scale movement paths of attached very high frequency radio-tags. To reveal patterns in migration, and to understand drivers of over-water flight, I captured and radio-tagged 115 eastern red bats (Lasiurus borealis) and subsequently tracked their movements. For the bats with evidence of large movements, most traveled in a southwesterly direction whereby paths were often oriented interior toward the continental landmass rather than being oriented along the coastline. This observation challenges earlier held beliefs that bats closely follow linear landscape features, such as the coast, when migrating. I documented bats traveling across wide sections of the Chesapeake and Delaware bays confirming the species' ability to travel across large water bodies. This behavior typically occurred in the early hours of the night and during favorable flying conditions such as low wind speeds, warm temperatures, and/or during sudden increases in temperature associated with the passage of cold fronts. For bats engaging in site residency through the fall, the proportion of night-hours in which bats were in a resting state (and possibly torpor), increased with colder temperatures and the progression of the fall season. My study demonstrated that bats may be at risk to offshore wind turbine collisions off the mid-Atlantic, but that this risk might be minimal if most bats are migrating toward the interior landscape rather than following the coast. Nonetheless, if flight over large water bodies such as Chesapeake and Delaware bays is a viable proxy for over-ocean flight, then collision risk at offshore wind turbines may be somewhat linked to atmospheric, seasonal timing, or other effects, and therefore some level of predictable and manageable with mitigations options such as smart curtailment.
- On the 3 M's of Epidemic Forecasting: Methods, Measures, and MetricsTabataba, Farzaneh Sadat (Virginia Tech, 2017-12-06)Over the past few decades, various computational and mathematical methodologies have been proposed for forecasting seasonal epidemics. In recent years, the deadly effects of enormous pandemics such as the H1N1 influenza virus, Ebola, and Zika, have compelled scientists to find new ways to improve the reliability and accuracy of epidemic forecasts. The improvement and variety of these prediction methods are undeniable. Nevertheless, many challenges remain unresolved in the path of forecasting the outbreaks using surveillance data. Obtaining the clean real-time data has always been an obstacle. Moreover, the surveillance data is usually noisy and handling the uncertainty of the observed data is a major issue for forecasting algorithms. Correct modeling assumptions regarding the nature of the infectious disease is another dilemma. Oversimplified models could lead to inaccurate forecasts, whereas more complicated methods require additional computational resources and information. Without those, the model may not be able to converge to a unique optimum solution. Through the last decade, there has been a significant effort towards achieving better epidemic forecasting algorithms. However, the lack of standard, well-defined evaluating metrics impedes a fair judgment on the proposed methods. This dissertation is divided into two parts. In the first part, we present a Bayesian particle filter calibration framework integrated with an agent-based model to forecast the epidemic trend of diseases like flu and Ebola. Our approach uses Bayesian statistics to estimate the underlying disease model parameters given the observed data and handle the uncertainty in the reasoning. An individual-based model with different intervention strategies could result in a large number of unknown parameters that should be properly calibrated. As particle filter could collapse in very large-scale systems (curse-of-dimensionality problem), achieving the optimum solution becomes more challenging. Our proposed particle filter framework utilizes machine learning concepts to restrain the intractable search space. It incorporates a smart analyzer in the state dynamics unit that examines the predicted and observed data using machine learning techniques to guide the direction and amount of perturbation of each parameter in the searching process. The second part of this dissertation focuses on providing standard evaluation measures for evaluating epidemic forecasts. We present an end-to-end framework that introduces epidemiologically relevant features (Epi-features), error measures, and ranking schema as the main modules of the evaluation process. Lastly, we provide the evaluation framework as a software package named Epi-Evaluator and demonstrate the potentials and capabilities of the framework by applying it to the output of different forecasting methods.
- A Pedagogical Approach to Create and Assess Domain-Specific Data Science Learning Materials in the Biomedical SciencesChen, Daniel (Virginia Tech, 2022-02-01)This dissertation explores creating a set of domain-specific learning materials for the biomedical sciences to meet the educational gap in biomedical informatics, while also meeting the call for statisticians advocating for process improvements in other disciplines. Data science educational materials are plenty enough to become a commodity. This provides the opportunity to create domain-specific learning materials to better motivate learning using real-world examples while also capturing intricacies of working with data in a specific domain. This dissertation shows how the use of persona methodologies can be combined with a backwards design approach of creating domain-specific learning materials. The work is divided into three (3) major steps: (1) create and validate a learner self-assessment survey that can identify learner personas by clustering. (2) combine the information from persona methodology with a backwards design approach using formative and summative assessments to curate, plan, and assess domain-specific data science workshop materials for short term and long term efficacy. (3) pilot and identify at how to manage real-time feedback within a data coding teaching session to drive better learner motivation and engagement. The key findings from this dissertation suggests using a structured framework to plan and curate learning materials is an effective way to identify key concepts in data science. However, just creating and teaching learning materials is not enough for long-term retention of knowledge. More effort for long-term lesson maintenance and long-term strategies for practice will help retain the concepts learned from live instruction. Finally, it is essential that we are careful and purposeful in our content creation as to not overwhelm learners and to integrate their needs into the materials as a primary focus. Overall, this contributes to the growing need for data science education in the biomedical sciences to train future clinicians use and work with data and improve patient outcomes.
- Precision Aggregated Local ModelsEdwards, Adam Michael (Virginia Tech, 2021-01-28)Large scale Gaussian process (GP) regression is infeasible for larger data sets due to cubic scaling of flops and quadratic storage involved in working with covariance matrices. Remedies in recent literature focus on divide-and-conquer, e.g., partitioning into sub-problems and inducing functional (and thus computational) independence. Such approximations can speedy, accurate, and sometimes even more flexible than an ordinary GPs. However, a big downside is loss of continuity at partition boundaries. Modern methods like local approximate GPs (LAGPs) imply effectively infinite partitioning and are thus pathologically good and bad in this regard. Model averaging, an alternative to divide-and-conquer, can maintain absolute continuity but often over-smooth, diminishing accuracy. Here I propose putting LAGP-like methods into a local experts-like framework, blending partition-based speed with model-averaging continuity, as a flagship example of what I call precision aggregated local models (PALM). Using N_C LAGPs, each selecting n from N data pairs, I illustrate a scheme that is at most cubic in n, quadratic in N_C, and linear in N, drastically reducing computational and storage demands. Extensive empirical illustration shows how PALM is at least as accurate as LAGP, can be much faster in terms of speed, and furnishes continuous predictive surfaces. Finally, I propose sequential updating scheme which greedily refines a PALM predictor up to a computational budget, and several variations on the basic PALM that may provide predictive improvements.
- Predictive Model Fusion: A Modular Approach to Big, Unstructured DataHoegh, Andrew B. (Virginia Tech, 2016-05-05)Data sets of increasing size and complexity require new approaches for prediction as the sheer volume of data from disparate sources inhibits joint processing and modeling. Rather modular segmentation is required, in which a set of models process (potentially overlapping) partitions of the data to independently construct predictions. This framework enables individuals models to be tailored for specific selective superiorities without concern for existing models, which provides utility in cases of segmented expertise. However, a method for fusing predictions from the collection of models is required as models may be correlated. This work details optimal principles for fusing binary predictions from a collection of models to issue a joint prediction. An efficient algorithm is introduced and compared with off the shelf methods for binary prediction. This framework is then implemented in an applied setting to predict instances of civil unrest in Central and South America. Finally, model fusion principles of a spatiotemporal nature are developed to predict civil unrest. A novel multiscale modeling is used for efficient, scalable computation for combining a set of spatiotemporal predictions.
- Robust Bayesian Anomaly Detection Methods for Large Scale Sensor SystemsMerkes, Sierra Nicole (Virginia Tech, 2022-09-12)Sensor systems, such as modern wind tunnels, require continual monitoring to validate their quality, as corrupted data will increase both experimental downtime and budget and lead to inconclusive scientific and engineering results. One approach to validate sensor quality is monitoring individual sensor measurements' distribution. Although, in general settings, we do not know how to correct measurements should be distributed for each sensor system. Instead of monitoring sensors individually, our approach relies on monitoring the co-variation of the entire network of sensor measurements, both within and across sensor systems. That is, by monitoring how sensors behave, relative to each other, we can detect anomalies expeditiously. Previous monitoring methodologies, such as those based on Principal Component Analysis, can be heavily influenced by extremely outlying sensor anomalies. We propose two Bayesian mixture model approaches that utilize heavy-tailed Cauchy assumptions. First, we propose a Robust Bayesian Regression, which utilizes a scale-mixture model to induce a Cauchy regression. Second, we extend elements of the Robust Bayesian Regression methodology using additive mixture models that decompose the anomalous and non-anomalous sensor readings into two parametric compartments. Specifically, we use a non-local, heavy-tailed Cauchy component for isolating the anomalous sensor readings, which we refer to as the Modified Cauchy Net.
- Sequential learning, large-scale calibration, and uncertainty quantificationHuang, Jiangeng (Virginia Tech, 2019-07-23)With remarkable advances in computing power, computer experiments continue to expand the boundaries and drive down the cost of various scientific discoveries. New challenges keep arising from designing, analyzing, modeling, calibrating, optimizing, and predicting in computer experiments. This dissertation consists of six chapters, exploring statistical methodologies in sequential learning, model calibration, and uncertainty quantification for heteroskedastic computer experiments and large-scale computer experiments. For heteroskedastic computer experiments, an optimal lookahead based sequential learning strategy is presented, balancing replication and exploration to facilitate separating signal from input-dependent noise. Motivated by challenges in both large data size and model fidelity arising from ever larger modern computer experiments, highly accurate and computationally efficient divide-and-conquer calibration methods based on on-site experimental design and surrogate modeling for large-scale computer models are developed in this dissertation. The proposed methodology is applied to calibrate a real computer experiment from the gas and oil industry. This on-site surrogate calibration method is further extended to multiple output calibration problems.
- Some Advances in Local Approximate Gaussian ProcessesSun, Furong (Virginia Tech, 2019-10-03)Nowadays, Gaussian Process (GP) has been recognized as an indispensable statistical tool in computer experiments. Due to its computational complexity and storage demand, its application in real-world problems, especially in "big data" settings, is quite limited. Among many strategies to tailor GP to such settings, Gramacy and Apley (2015) proposed local approximate GP (laGP), which constructs approximate predictive equations by constructing small local designs around the predictive location under certain criterion. In this dissertation, several methodological extensions based upon laGP are proposed. One methodological contribution is the multilevel global/local modeling, which deploys global hyper-parameter estimates to perform local prediction. The second contribution comes from extending the laGP notion of "locale" to a set of predictive locations, along paths in the input space. These two contributions have been applied in the satellite drag emulation, which is illustrated in Chapter 3. Furthermore, the multilevel GP modeling strategy has also been applied to synthesize field data and computer model outputs of solar irradiance across the continental United States, combined with inverse-variance weighting, which is detailed in Chapter 4. Last but not least, in Chapter 5, laGP's performance has been tested on emulating daytime land surface temperatures estimated via satellites, in the settings of irregular grid locations.
- Stochastic Computer Model Calibration and Uncertainty QuantificationFadikar, Arindam (Virginia Tech, 2019-07-24)This dissertation presents novel methodologies in the field of stochastic computer model calibration and uncertainty quantification. Simulation models are widely used in studying physical systems, which are often represented by a set of mathematical equations. Inference on true physical system (unobserved or partially observed) is drawn based on the observations from corresponding computer simulation model. These computer models are calibrated based on limited ground truth observations in order produce realistic predictions and associated uncertainties. Stochastic computer model differs from traditional computer model in the sense that repeated execution results in different outcomes from a stochastic simulation. This additional uncertainty in the simulation model requires to be handled accordingly in any calibration set up. Gaussian process (GP) emulator replaces the actual computer simulation when it is expensive to run and the budget is limited. However, traditional GP interpolator models the mean and/or variance of the simulation output as function of input. For a simulation where marginal gaussianity assumption is not appropriate, it does not suffice to emulate only the mean and/or variance. We present two different approaches addressing the non-gaussianity behavior of an emulator, by (1) incorporating quantile regression in GP for multivariate output, (2) approximating using finite mixture of gaussians. These emulators are also used to calibrate and make forward predictions in the context of an Agent Based disease model which models the Ebola epidemic outbreak in 2014 in West Africa. The third approach employs a sequential scheme which periodically updates the uncertainty inn the computer model input as data becomes available in an online fashion. Unlike other two methods which use an emulator in place of the actual simulation, the sequential approach relies on repeated run of the actual, potentially expensive simulation.
- Towards an in silico Experimental Platform for Air Quality: Houston, TX as a Case StudyPires, Bianica; Korkmaz, Gizem; Ensor, Katherine; Higdon, David; Keller, Sallie A.; Lewis, Bryan L.; Schroeder, Aaron (CSSSA, 2015)In this paper we couple a spatiotemporal air quality model of ozone concentration levels with the synthetic information model of the Houston Metropolitan Area. While traditional approaches often aggregate the population, activities, or concentration levels of the pollutant across space and/or time, we utilize high performance computing and statistical learning tools to maintain the granularity of the data, allowing us to attach specific exposure levels to the synthetic individuals based on the exact time of day and geolocation of the activity. We demonstrate that maintaining the granularity of the data is critical to more accurately reflect the heterogeneous exposure levels of the population across time within the greater Houston area. We nd that individuals in the same zip code, neighborhood, block, and even household have varying levels of exposure depending on their activity patterns throughout the day.