Browsing by Author "Kim, Inyoung"
Now showing 1 - 20 of 95
Results Per Page
Sort Options
- Advanced Machine Learning for Surrogate Modeling in Complex Engineering SystemsLee, Cheol Hei (Virginia Tech, 2023-08-02)Surrogate models are indispensable in the analysis of engineering systems. The quality of surrogate models is determined by the data quality and the model class but achieving a high standard of them is challenging in complex engineering systems. Heterogeneity, implicit constraints, and extreme events are typical examples of the factors that complicate systems, yet they have been underestimated or disregarded in machine learning. This dissertation is dedicated to tackling the challenges in surrogate modeling of complex engineering systems by developing the following machine learning methodologies. (i) Partitioned active learning partitions the design space according to heterogeneity in response features, thereby exploiting localized models to measure the informativeness of unlabeled data. (ii) For the systems with implicit constraints, failure-averse active learning incorporates constraint outputs to estimate the safe region and avoid undesirable failures in learning the target function. (iii) The multi-output extreme spatial learning enables modeling and simulating extreme events in composite fuselage assembly. The proposed methods were applied to real-world case studies and outperformed benchmark methods.
- Advanced Nonparametric Bayesian Functional ModelingGao, Wenyu (Virginia Tech, 2020-09-04)Functional analyses have gained more interest as we have easier access to massive data sets. However, such data sets often contain large heterogeneities, noise, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This dissertation considers noisy information reduction in functional analyses from two perspectives: functional variable selection to reduce the dimensionality and functional clustering to group similar observations and thus reduce the sample size. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model, or developed from a more generic one by changing the prior distributions. Hence, this dissertation focuses on the development of Bayesian approaches for functional analyses due to their flexibilities. A nonparametric Bayesian approach, such as the Dirichlet process mixture (DPM) model, has a nonparametric distribution as the prior. This approach provides flexibility and reduces assumptions, especially for functional clustering, because the DPM model has an automatic clustering property, so the number of clusters does not need to be specified in advance. Furthermore, a weighted Dirichlet process mixture (WDPM) model allows for more heterogeneities from the data by assuming more than one unknown prior distribution. It also gathers more information from the data by introducing a weight function that assigns different candidate priors, such that the less similar observations are more separated. Thus, the WDPM model will improve the clustering and model estimation results. In this dissertation, we used an advanced nonparametric Bayesian approach to study functional variable selection and functional clustering methods. We proposed 1) a stochastic search functional selection method with application to 1-M matched case-crossover studies for aseptic meningitis, to examine the time-varying unknown relationship and find out important covariates affecting disease contractions; 2) a functional clustering method via the WDPM model, with application to three pathways related to genetic diabetes data, to identify essential genes distinguishing between normal and disease groups; and 3) a combined functional clustering, with the WDPM model, and variable selection approach with application to high-frequency spectral data, to select wavelengths associated with breast cancer racial disparities.
- Advancements in Degradation Modeling, Uncertainty Quantification and Spatial Variable SelectionXie, Yimeng (Virginia Tech, 2016-06-30)This dissertation focuses on three research projects: 1) construction of simultaneous prediction intervals/bounds for at least k out of m future observations; 2) semi-parametric degradation model for accelerated destructive degradation test (ADDT) data; and 3) spatial variable selection and application to Lyme disease data in Virginia. Followed by the general introduction in Chapter 1, the rest of the dissertation consists of three main chapters. Chapter 2 presents the construction of two-sided simultaneous prediction intervals (SPIs) or one-sided simultaneous prediction bounds (SPBs) to contain at least k out of m future observations, based on complete or right censored data from (log)-location-scale family of distributions. SPI/SPB calculated by the proposed procedure has exact coverage probability for complete and Type II censored data. In Type I censoring case, it has asymptotically correct coverage probability and reasonably good results for small samples. The proposed procedures can be extended to multiply-censored data or randomly censored data. Chapter 3 focuses on the analysis of ADDT data. We use a general degradation path model with correlated covariance structure to describe ADDT data. Monotone B-splines are used to modeling the underlying degradation process. A likelihood based iterative procedure for parameter estimation is developed. The confidence intervals of parameters are calculated using the nonparametric bootstrap procedure. Both simulated data and real datasets are used to compare the semi-parametric model with the existing parametric models. Chapter 4 studies the Lyme disease emergence in Virginia. The objective is to find important environmental and demographical covariates that are associated with Lyme disease emergence. To address the high-dimentional integral problem in the loglikelihood function, we consider the penalized quasi loglikelihood and the approximated loglikelihood based on Laplace approximation. We impose the adaptive elastic net penalty to obtain sparse estimation of parameters and thus to achieve variable selection of important variables. The proposed methods are investigated in simulation studies. We also apply the proposed methods to Lyme disease data in Virginia. Finally, Chapter 5 contains general conclusions and discussions for future work.
- Advancements on the Interface of Computer Experiments and Survival AnalysisWang, Yueyao (Virginia Tech, 2022-07-20)Design and analysis of computer experiments is an area focusing on efficient data collection (e.g., space-filling designs), surrogate modeling (e.g., Gaussian process models), and uncertainty quantification. Survival analysis focuses on modeling the period of time until a certain event happens. Data collection, prediction, and uncertainty quantification are also fundamental in survival models. In this dissertation, the proposed methods are motivated by a wide range of real world applications, including high-performance computing (HPC) variability data, jet engine reliability data, Titan GPU lifetime data, and pine tree survival data. This dissertation is to explore interfaces on computer experiments and survival analysis with the above applications. Chapter 1 provides a general introduction to computer experiments and survival analysis. Chapter 2 focuses on the HPC variability management application. We investigate the applicability of space-filling designs and statistical surrogates in the HPC variability management setting, in terms of design efficiency, prediction accuracy, and scalability. A comprehensive comparison of the design strategies and predictive methods is conducted to study the combinations' performance in prediction accuracy. Chapter 3 focuses on the reliability prediction application. With the availability of multi-channel sensor data, a single degradation index is needed to be compatible with most existing models. We propose a flexible framework with multi-sensory data to model the nonlinear relationship between sensors and the degradation process. We also involve the automatic variable selection to exclude sensors that have no effect on the underlying degradation process. Chapter 4 investigates inference approaches for spatial survival analysis under the Bayesian framework. The Markov chain Monte Carlo (MCMC) approaches and variational inferences performance are studied for two survival models, the cumulative exposure model and the proportional hazard (PH) model. The Titan GPU data and pine tree survival data are used to illustrate the capability of variational inference on spatial survival models. Chapter 5 provides some general conclusions.
- Advances in Iterative Probabilistic Processing for Communication ReceiversJakubisin, Daniel Joseph (Virginia Tech, 2016-06-27)As wireless communication systems continue to push the limits of energy and spectral efficiency, increased demands are placed on the capabilities of the receiver. At the same time, the computational resources available for processing received signals will continue to grow. This opens the door for iterative algorithms to play an increasing role in the next generation of communication receivers. In the context of receivers, the goal of iterative probabilistic processing is to approximate maximum a posteriori (MAP) symbol-by-symbol detection of the information bits and estimation of the unknown channel or signal parameters. The sum-product algorithm is capable of efficiently approximating the marginal posterior probabilities desired for MAP detection and provides a unifying framework for the development of iterative receiver algorithms. However, in some applications the sum-product algorithm is computationally infeasible. Specifically, this is the case when both continuous and discrete parameters are present within the model. Also, the complexity of the sum-product algorithm is exponential in the number of variables connected to a particular factor node and can be prohibitive in multi-user and multi-antenna applications. In this dissertation we identify three key problems which can benefit from iterative probabilistic processing, but for which the sum-product algorithm is too complex. They are (1) joint synchronization and detection in multipath channels with emphasis on frame timing, (2) detection in co-channel interference and non-Gaussian noise, and (3) joint channel estimation and multi-signal detection. This dissertation presents the advances we have made in iterative probabilistic processing in order to tackle these problems. The motivation behind the work is to (a) compromise as little as possible on the performance that is achieved while limiting the computational complexity and (b) maintain good theoretical justification to the algorithms that are developed.
- Advances in the Side-Channel Analysis of Symmetric CryptographyTaha, Mostafa Mohamed Ibrahim (Virginia Tech, 2014-06-10)Side-Channel Analysis (SCA) is an implementation attack where an adversary exploits unintentional outputs of a cryptographic module to reveal secret information. Unintentional outputs, also called side-channel outputs, include power consumption, electromagnetic radiation, execution time, photonic emissions, acoustic waves and many more. The real threat of SCA lies in the ability to mount attacks over small parts of the key and to aggregate information over many different traces. The cryptographic community acknowledges that SCA can break any security module if the adequate protection is not implemented. In this dissertation, we propose several advances in side-channel attacks and countermeasures. We focus on symmetric cryptographic primitives, namely: block-ciphers and hashing functions. In the first part, we focus on improving side-channel attacks. First, we propose a new method to profile highly parallel cryptographic modules. Profiling, in the context of SCA, characterizes the power consumption of a fully-controlled module to extract power signatures. Then, the power signatures are used to attack a similar module. Parallel designs show excessive algorithmic-noise in the power trace. Hence, we propose a novel attack that takes design parallelism into consideration, which results in a more powerful attack. Also, we propose the first comprehensive SCA of the new secure hashing function mbox{SHA-3}. Although the main application of mbox{SHA-3} is hashing, there are other keyed applications including Message Authentication Codes (MACs), where protection against SCA is required. We study the SCA properties of all the operations involved in mbox{SHA-3}. We also study the effect of changing the key-length on the difficulty of mounting attacks. Indeed, changing the key-length changes the attack methodology. Hence, we propose complete attacks against five different case studies, and propose a systematic algorithm to choose an attack methodology based on the key-length. In the second part, we propose different techniques for protection against SCA. Indeed, the threat of SCA can be mitigated if the secret key changes before every execution. Although many contributions, in the domain of leakage resilient cryptography, tried to achieve this goal, the proposed solutions were inefficient and required very high implementation cost. Hence, we highlight a generic framework for efficient leakage resiliency through lightweight key-updating. Then, we propose two complete solutions for protecting AES modes of operation. One uses a dedicated circuit for key-updating, while the other uses the underlying AES block cipher itself. The first one requires small area (for the additional circuit) but achieves negligible performance overhead. The second one has no area overhead but requires small performance overhead. Also, we address the problem of executing all the applications of hashing functions, e.g. the unkeyed application of regular hashing and the keyed application of generating MACs, on the same core. We observe that, running unkeyed application on an SCA-protected core will involve a huge loss of performance (3x to 4x). Hence, we propose a novel SCA-protected core for hashing. Our core has no overhead in unkeyed applications, and negligible overhead in keyed ones. Our research provides a better understanding of side-channel analysis and supports the cryptographic community with lightweight and efficient countermeasures.
- Applications of Different Weighting Schemes to Improve Pathway-Based AnalysisHa, Sook S.; Kim, Inyoung; Wang, Yue; Xuan, Jianhua (Hindawi, 2011-05-22)Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proved incorrect, and applying uniform weight in the pathway analysis may not be an appropriate approach for the tasks like molecular classification of diseases, as genes in a functional group may have different predicting power. Hence, we propose to use different weights to genes in pathway-based analysis and devise four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Among all schemes, random weighting scheme, which generates random weights and selects optimal weights minimizing an objective function, performs best in terms of 𝑷 value or error rate reduction. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight.
- Bayesian Approach Dealing with Mixture Model ProblemsZhang, Huaiye (Virginia Tech, 2012-04-23)In this dissertation, we focus on two research topics related to mixture models. The first topic is Adaptive Rejection Metropolis Simulated Annealing for Detecting Global Maximum Regions, and the second topic is Bayesian Model Selection for Nonlinear Mixed Effects Model. In the first topic, we consider a finite mixture model, which is used to fit the data from heterogeneous populations for many applications. An Expectation Maximization (EM) algorithm and Markov Chain Monte Carlo (MCMC) are two popular methods to estimate parameters in a finite mixture model. However, both of the methods may converge to local maximum regions rather than the global maximum when multiple local maxima exist. In this dissertation, we propose a new approach, Adaptive Rejection Metropolis Simulated Annealing (ARMS annealing), to improve the EM algorithm and MCMC methods. Combining simulated annealing (SA) and adaptive rejection metropolis sampling (ARMS), ARMS annealing generate a set of proper starting points which help to reach all possible modes. ARMS uses a piecewise linear envelope function for a proposal distribution. Under the SA framework, we start with a set of proposal distributions, which are constructed by ARMS, and this method finds a set of proper starting points, which help to detect separate modes. We refer to this approach as ARMS annealing. By combining together ARMS annealing with the EM algorithm and with the Bayesian approach, respectively, we have proposed two approaches: an EM ARMS annealing algorithm and a Bayesian ARMS annealing approach. EM ARMS annealing implement the EM algorithm by using a set of starting points proposed by ARMS annealing. ARMS annealing also helps MCMC approaches determine starting points. Both approaches capture the global maximum region and estimate the parameters accurately. An illustrative example uses a survey data on the number of charitable donations. The second topic is related to the nonlinear mixed effects model (NLME). Typically a parametric NLME model requires strong assumptions which make the model less flexible and often are not satisfied in real applications. To allow the NLME model to have more flexible assumptions, we present three semiparametric Bayesian NLME models, constructed with Dirichlet process (DP) priors. Dirichlet process models often refer to an infinite mixture model. We propose a unified approach, the penalized posterior Bayes factor, for the purpose of model comparison. Using simulation studies, we compare the performance of two of the three semiparametric hierarchical Bayesian approaches with that of the parametric Bayesian approach. Simulation results suggest that our penalized posterior Bayes factor is a robust method for comparing hierarchical parametric and semiparametric models. An application to gastric emptying studies is used to demonstrate the advantage of our estimation and evaluation approaches.
- Bayesian Factor Models for Clustering and Spatiotemporal AnalysisShin, Hwasoo (Virginia Tech, 2024-05-28)Multivariate data is prevalent in modern applications, yet it often presents significant analytical challenges. Factor models can offer an effective tool to address issues associated with large-scale datasets. In this dissertation, we propose two novel Bayesian factors models. These models are designed to effectively reduce the dimensionality of the data, as the number of latent factors is typically much smaller than that of the observation vectors. Therefore, our proposed models can achieve substantial dimension reduction. Our first model is for spatiotemporal areal data. In this case, the region of interest is divided into subregions, and at each time point, there is one univariate observation per subregion. Our model writes the vector of observations at each time point in a factor model form as the product of a vector of factor loadings and a vector of common factors plus a vector of error. Our model assumes that the common factor evolves through time according to a dynamic linear model. To represent the spatial relationships among subregions, each column of the factor loadings matrix is assigned intrinsic conditional autoregressive (ICAR) priors. Therefore, we call our approach the Dynamic ICAR Spatiotemporal Factor Models (DIFM). Our second model, Bayesian Clustering Factor Model (BCFM) assumes latent factors and clusters are present in the data. We apply Gaussian mixture models on common factors to discover clusters. For both models, we develop MCMC to explore the posterior distribution of the parameters. To select the number of factors and, in the case of clustering methods, the number of clusters, we develop model selection criteria that utilize the Laplace-Metropolis estimator of the predictive density and BIC with integrated likelihood.
- Bayesian Hierarchical Latent Model for Gene Set AnalysisChao, Yi (Virginia Tech, 2009-04-29)Pathway is a set of genes which are predefined and serve a particular celluar or physiological function. Ranking pathways relevant to a particular phenotype can help researchers focus on a few sets of genes in pathways. In this thesis, a Bayesian hierarchical latent model was proposed using generalized linear random effects model. The advantage of the approach was that it can easily incorporate prior knowledges when the sample size was small and the number of genes was large. For the covariance matrix of a set of random variables, two Gaussian random processes were considered to construct the dependencies among genes in a pathway. One was based on the polynomial kernel and the other was based on the Gaussian kernel. Then these two kernels were compared with constant covariance matrix of the random effect by using the ratio, which was based on the joint posterior distribution with respect to each model. For mixture models, log-likelihood values were computed at different values of the mixture proportion, compared among mixtures of selected kernels and point-mass density (or constant covariance matrix). The approach was applied to a data set (Mootha et al., 2003) containing the expression profiles of type II diabetes where the motivation was to identify pathways that can discriminate between normal patients and patients with type II diabetes.
- Bayesian Inference Based on Nonparametric Regression for Highly Correlated and High Dimensional DataYun, Young Ho (Virginia Tech, 2024-12-13)Establishing relationships among observed variables is important in many research studies. However, the task becomes increasingly difficult in the presence of unidentified complexities stemming from interdependencies among multi-dimensional variables and variability across subjects. This dissertation presents three novel methodological approaches to address these complex associations between highly correlated and high dimensional data. Firstly, group multi-kernel machine regression (GMM) is proposed to identify the association between two sets of multidimensional functions, offering flexibility to effectively capture the complex association among high-dimensional variables. Secondly, semiparametric kernel machine regression under a Bayesian hierarchical structure is introduced for matched case-crossover studies, enabling flexible modeling of multiple covariate effects within strata and their complex interactions, denoted as fused kernel machine regression (Fused-KMR). Lastly, it presents a Bayesian hierarchical framework designed to identify multiple change points in the relationship between ambient temperature and mortality rate. This framework, unlike traditional methods, treats change points as random variables, enabling the modeling of nonparametric functions that vary by region and is denoted as a multiple random change point (MRCP). Simulation studies and real-world applications illustrate the effectiveness and advantages of these approaches in capturing intricate associations and enhancing predictive accuracy.
- Bayesian Modeling of Complex High-Dimensional DataHuo, Shuning (Virginia Tech, 2020-12-07)With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional complex data in different forms, such as medical images, genomics measurements. However, acquisition of more data does not automatically lead to better knowledge discovery. One needs efficient and reliable analytical tools to extract useful information from complex datasets. The main objective of this dissertation is to develop innovative Bayesian methodologies to enable effective and efficient knowledge discovery from complex high-dimensional data. It contains two parts—the development of computationally efficient functional mixed models and the modeling of data heterogeneity via Dirichlet Diffusion Tree. The first part focuses on tackling the computational bottleneck in Bayesian functional mixed models. We propose a computational framework called variational functional mixed model (VFMM). This new method facilitates efficient data compression and high-performance computing in basis space. We also propose a new multiple testing procedure in basis space, which can be used to detect significant local regions. The effectiveness of the proposed model is demonstrated through two datasets, a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part is about modeling data heterogeneity by using Dirichlet Diffusion Trees. We propose a Bayesian latent tree model that incorporates covariates of subjects to characterize the heterogeneity and uncover the latent tree structure underlying data. This innovative model may reveal the hierarchical evolution process through branch structures and estimate systematic differences between groups of samples. We demonstrate the effectiveness of the model through the simulation study and a brain tumor real data.
- Bayesian Multilevel-multiclass Graphical ModelLin, Jiali (Virginia Tech, 2019-06-21)Gaussian graphical model has been a popular tool to investigate conditional dependency between random variables by estimating sparse precision matrices. Two problems have been discussed. One is to learn multiple Gaussian graphical models at multilevel from unknown classes. Another one is to select Gaussian process in semiparametric multi-kernel machine regression. The first problem is approached by Gaussian graphical model. In this project, I consider learning multiple connected graphs among multilevel variables from unknown classes. I esti- mate the classes of the observations from the mixture distributions by evaluating the Bayes factor and learn the network structures by fitting a novel neighborhood selection algorithm. This approach is able to identify the class membership and to reveal network structures for multilevel variables simultaneously. Unlike most existing methods that solve this problem by frequentist approaches, I assess an alternative to a novel hierarchical Bayesian approach to incorporate prior knowledge. The second problem focuses on the analysis of correlated high-dimensional data which has been useful in many applications. In this work, I consider a problem of detecting signals with a semiparametric regression model which can study the effects of fixed covariates (e.g. clinical variables) and sets of elements (e.g. pathways of genes). I model the unknown high-dimension functions of multi-sets via multi-Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, my variable selection can be considered as Gaussian process selection. I develop my Gaussian process selection under the Bayesian variable selection framework.
- Bayesian Optimization for Engineering Design and Quality Control of Manufacturing SystemsAlBahar, Areej Ahmad (Virginia Tech, 2022-04-14)Manufacturing systems are usually nonlinear, nonstationary, highly corrupted with outliers, and oftentimes constrained by physical laws. Modeling and approximation of their underly- ing response surface functions are extremely challenging. Bayesian optimization is a great statistical tool, based on Bayes rule, used to optimize and model these expensive-to-evaluate functions. Bayesian optimization comprises of two important components namely, a sur- rogate model often the Gaussian process and an acquisition function often the expected improvement. The Gaussian process, known for its outstanding modeling and uncertainty quantification capabilities, is used to represent the underlying response surface function, while the expected improvement is used to select the next point to be evaluated by trading- off exploitation and exploration. Although Bayesian optimization has been extensively used in optimizing unknown and expensive-to-evaluate functions and in hyperparameter tuning of deep learning models, mod- eling highly outlier-corrupted, nonstationary, and stress-induced response surface functions hinder the use of conventional Bayesian optimization models in manufacturing systems. To overcome these limitations, we propose a series of systematic methodologies to improve Bayesian optimization for engineering design and quality control of manufacturing systems. Specifically, the contributions of this dissertation can be summarized as follows. 1. A novel asymmetric robust kernel function, called AEN-RBF, is proposed to model highly outlier-corrupted functions. Two new hyperparameters are introduced to im- prove the flexibility and robustness of the Gaussian process model. 2. A nonstationary surrogate model that utilizes deep multi-layer Gaussian processes, called MGP-CBO, is developed to improve the modeling of complex anisotropic con- strained nonstationary functions. 3. A Stress-Aware Optimal Actuator Placement framework is designed to model and op- timize stress-induced nonlinear constrained functions. Through extensive evaluations, the proposed methodologies have shown outstanding and significant improvements when compared to state-of-the-art models. Although these pro- posed methodologies have been applied to certain manufacturing systems, they can be easily adapted to other broad ranges of problems.
- Bayesian variable selection for linear mixed models when p is much larger than n with applications in genome wide association studiesWilliams, Jacob Robert Michael (Virginia Tech, 2023-06-05)Genome-wide association studies (GWAS) seek to identify single nucleotide polymorphisms (SNP) causing phenotypic responses in individuals. Commonly, GWAS analyses are done by using single marker association testing (SMA) which investigates the effect of a single SNP at a time and selects a candidate set of SNPs using a strict multiple correction penalty. As SNPs are not independent but instead strongly correlated, SMA methods lead to such high false discovery rates (FDR) that the results are difficult to use by wet lab scientists. To address this, this dissertation proposes three different novel Bayesian methods: BICOSS, BGWAS, and IEB. From a Bayesian modeling point of view, SNP search can be seen as a variable selection problem in linear mixed models (LMMs) where $p$ is much larger than $n$. To deal with the $p>>n$ issue, our three proposed methods use novel Bayesian approaches based on two steps: a screening step and a model selection step. To control false discoveries, we link the screening and model selection steps through a common probability of a null SNP. To deal with model selection, we propose novel priors that are extensions for LMMs of nonlocal priors, Zellner-g prior, unit Information prior, and Zellner-Siow prior. For each method, extensive simulation studies and case studies show that these methods improve the recall of true causal SNPs and, more importantly, drastically decrease FDR. Because our Bayesian methods provide more focused and precise results, they may speed up discovery of important SNPs and significantly contribute to scientific progress in the areas of biology, agricultural productivity, and human health.
- Bio-interfaced Nanolaminate Surface-enhanced Raman Spectroscopy SubstratesNam, Wonil (Virginia Tech, 2022-03-30)Surface-enhanced Raman spectroscopy (SERS) is a powerful analytical technique that combines molecular specificity of vibrational fingerprints offered by Raman spectroscopy with single-molecule detection sensitivity from plasmonic hotspots of noble metal nanostructures. Label-free SERS has attracted tremendous interest in bioanalysis over the last two decades due to minimal sample preparation, non-invasive measurement without water background interference, and multiplexing capability from rich chemical information of narrow Raman bands. Nevertheless, significant challenges should be addressed to become a widely accepted technique in bio-related communities. In this dissertation, limitations from different aspects (performance, reliability, and analysis) are articulated with state-of-the-art, followed by how introduced works resolve them. For high SERS performance, SERS substrates consisting of vertically-stacked multiple metal-insulator-metal layers, named nanolaminate, were designed to simultaneously achieve high sensitivity and excellent uniformity, two previously deemed mutually exclusive properties. Two unique factors of nanolaminate SERS substrates were exploited for the improved reliability of label-free in situ classification using living cancer cells, including background refractive index (RI) insensitivity from 1.30 to 1.60, covering extracellular components, and 3D protruding nanostructures that can generate a tight nano-bio interface (e.g., hotspot-cell coupling). Discrete nanolamination by new nanofabrication additionally provides optical transparency, offering backside-excitation, thereby label-free glucose sensing on a skin-phantom model. Towards reliable quantitative SERS analysis, an electronic Raman scattering (ERS) calibration method was developed. ERS from metal is omnipresent in plasmonic constructs and experiences identical hotspot enhancements. Rigorous experimental results support that ERS can serve as internal standards for spatial and temporal calibration of SERS signals with significant potential for complex samples by overcoming intrinsic limitations of state-of-art Raman tags. ERS calibration was successfully applied to label-free living cell SERS datasets for classifying cancer subtypes and cellular drug responses. Furthermore, dual-recognition label-SERS with digital assay revealed improved accuracy in quantitative dopamine analysis. Artificial neural network-based advanced machine learning method was exploited to improve the interpretability of bioanalytical SERS for multiple living cell responses. Finally, this dissertation provides future perspectives with different aspects to design bio-interfaced SERS devices for clinical translation, followed by guidance for SERS to become a standard analytical method that can compete with or complement existing technologies.
- Cluster_Based Profile Monitoring in Phase I AnalysisChen, Yajuan (Virginia Tech, 2014-03-26)Profile monitoring is a well-known approach used in statistical process control where the quality of the product or process is characterized by a profile or a relationship between a response variable and one or more explanatory variables. Profile monitoring is conducted over two phases, labeled as Phase I and Phase II. In Phase I profile monitoring, regression methods are used to model each profile and to detect the possible presence of out-of-control profiles in the historical data set (HDS). The out-of-control profiles can be detected by using the statis-tic. However, previous methods of calculating the statistic are based on using all the data in the HDS including the data from the out-of-control process. Consequently, the ability of using this method can be distorted if the HDS contains data from the out-of-control process. This work provides a new profile monitoring methodology for Phase I analysis. The proposed method, referred to as the cluster-based profile monitoring method, incorporates a cluster analysis phase before calculating the statistic. Before introducing our proposed cluster-based method in profile monitoring, this cluster-based method is demonstrated to work efficiently in robust regression, referred to as cluster-based bounded influence regression or CBI. It will be demonstrated that the CBI method provides a robust, efficient and high breakdown regression parameter estimator. The CBI method first represents the data space via a special set of points, referred to as anchor points. Then a collection of single-point-added ordinary least squares regression estimators forms the basis of a metric used in defining the similarity between any two observations. Cluster analysis then yields a main cluster containing at least half the observations, with the remaining observations comprising one or more minor clusters. An initial regression estimator arises from the main cluster, with a group-additive DFFITS argument used to carefully activate the minor clusters through a bounded influence regression frame work. CBI achieves a 50% breakdown point, is regression equivariant, scale and affine equivariant and distributionally is asymptotically normal. Case studies and Monte Carlo results demonstrate the performance advantage of CBI over other popular robust regression procedures regarding coefficient stabil-ity, scale estimation and standard errors. The cluster-based method in Phase I profile monitoring first replaces the data from each sampled unit with an estimated profile, using some appropriate regression method. The estimated parameters for the parametric profiles are obtained from parametric models while the estimated parameters for the nonparametric profiles are obtained from the p-spline model. The cluster phase clusters the profiles based on their estimated parameters and this yields an initial main cluster which contains at least half the profiles. The initial estimated parameters for the population average (PA) profile are obtained by fitting a mixed model (parametric or nonparametric) to those profiles in the main cluster. Profiles that are not contained in the initial main cluster are iteratively added to the main cluster provided their statistics are "small" and the mixed model (parametric or nonparametric) is used to update the estimated parameters for the PA profile. Those profiles contained in the final main cluster are considered as resulting from the in-control process while those not included are considered as resulting from an out-of-control process. This cluster-based method has been applied to monitor both parametric and nonparametric profiles. A simulated example, a Monte Carlo study and an application to a real data set demonstrates the detail of the algorithm and the performance advantage of this proposed method over a non-cluster-based method is demonstrated with respect to more accurate estimates of the PA parameters and improved classification performance criteria. When the profiles can be represented by vectors, the profile monitoring process is equivalent to the detection of multivariate outliers. For this reason, we also compared our proposed method to a popular method used to identify outliers when dealing with a multivariate response. Our study demonstrated that when the out-of-control process corresponds to a sustained shift, the cluster-based method using the successive difference estimator is clearly the superior method, among those methods we considered, based on all performance criteria. In addition, the influence of accurate Phase I estimates on the performance of Phase II control charts is presented to show the further advantage of the proposed method. A simple example and Monte Carlo results show that more accurate estimates from Phase I would provide more efficient Phase II control charts.
- A Comparison of Discrete and Continuous Survival AnalysisKim, Sunha (Virginia Tech, 2014-05-08)There has been confusion in choosing a proper survival model between two popular survival models of discrete and continuous survival analysis. This study aimed to provide empirical outcomes of two survival models in educational contexts and suggest a guideline for researchers who should adopt a suitable survival model. For the model specification, the study paid attention to three factors of time metrics, censoring proportions, and sample sizes. To arrive at comprehensive understanding of the three factors, the study investigated the separate and combined effect of these factors. Furthermore, to understand the interaction mechanism of those factors, this study examined the role of the factors to determine hazard rates which have been known to cause the discrepancies between discrete and continuous survival models. To provide empirical evidence from different combinations of the factors in the use of survival analysis, this study built a series of discrete and continuous survival models using secondary data and simulated data. In the first study, using empirical data from the National Longitudinal Survey of Youth 1997 (NLSY97), this study compared analyses results from the two models having different sizes of time metrics. In the second study, by having various specifications with combination of two other factors of censoring proportions and sample sizes, this study simulated datasets to build two models and compared the analysis results. The major finding of the study is that discrete models are recommended in the conditions of large units of time metrics, low censoring proportion, or small sample sizes. Particularly, discrete model produced better outcomes for conditions with low censoring proportion (20%) and small number (i.e., four) of large time metrics (i.e., year) regardless of sample sizes. Close examination of those conditions of time metrics, censoring proportion, and sample sizes showed that the conditions resulted into high hazards (i.e., 0.20). In conclusion, to determine a proper model, it is recommended to examine hazards of each of the time units with the specific factors of time metrics, censoring proportion and sample sizes.
- Computational Approaches to Predict Effect of Epigenetic Modifications on Transcriptional Regulation of Gene ExpressionBanerjee, Sharmi (Virginia Tech, 2019-10-07)This dissertation presents applications of machine learning and statistical approaches to infer protein-DNA bindings in the presence of epigenetic modifications. Epigenetic modifications are alterations to the DNA resulting in gene expression regulation where the structure of the DNA remains unaltered. It is a heritable and reversible modification and often involves addition or deletion of certain chemical compounds to the DNA. Histone modification is an epigenetic change that involves alteration of the histone proteins – thus changing the chromatin (DNA wound around histone proteins) structure – or addition of methyl-groups to the Cytosine base adjacent to a Guanine base. Epigenetic factors often interfere in gene expression regulation by promoting or inhibiting protein-DNA bindings. Such proteins are known as transcription factors. Transcription is the first step of gene expression where a particular segment of DNA is copied into the messenger-RNA (mRNA). Transcription factors orchestrate gene activity and are crucial for normal cell function in any organism. For example, deletion/mutation of certain transcription factors such as MEF2 have been associated with neurological disorders such as autism and schizophrenia. In this dissertation, different computational pipelines are described that use mathematical models to explain how the protein-DNA bindings are mediated by histone modifications and DNA-methylation affecting different regions of the brain at different stages of development. Multi-layer Markov models, Inhomogeneous Poisson analyses are used on data from brain to show the impact of epigenetic factors on protein-DNA bindings. Such data driven approaches reinforce the importance of epigenetic factors in governing brain cell differentiation into different neuron types, regulation of memory and promotion of normal brain development at the early stages of life.
- Contributions to Data Reduction and Statistical Model of Data with Complex StructuresWei, Yanran (Virginia Tech, 2022-08-30)With advanced technology and information explosion, the data of interest often have complex structures, with the large size and dimensions in the form of continuous or discrete features. There is an emerging need for data reduction, efficient modeling, and model inference. For example, data can contain millions of observations with thousands of features. Traditional methods, such as linear regression or LASSO regression, cannot effectively deal with such a large dataset directly. This dissertation aims to develop several techniques to effectively analyze large datasets with complex structures in the observational, experimental and time series data. In Chapter 2, I focus on the data reduction for model estimation of sparse regression. The commonly-used subdata selection method often considers sampling or feature screening. Un- der the case of data with both large number of observation and predictors, we proposed a filtering approach for model estimation (FAME) to reduce both the size of data points and features. The proposed algorithm can be easily extended for data with discrete response or discrete predictors. Through simulations and case studies, the proposed method provides a good performance for parameter estimation with efficient computation. In Chapter 3, I focus on modeling the experimental data with quantitative-sequence (QS) factor. Here the QS factor concerns both quantities and sequence orders of several compo- nents in the experiment. Existing methods usually can only focus on the sequence orders or quantities of the multiple components. To fill this gap, we propose a QS transformation to transform the QS factor to a generalized permutation matrix, and consequently develop a simple Gaussian process approach to model the experimental data with QS factors. In Chapter 4, I focus on forecasting multivariate time series data by leveraging the au- toregression and clustering. Existing time series forecasting method treat each series data independently and ignore their inherent correlation. To fill this gap, I proposed a clustering based on autoregression and control the sparsity of the transition matrix estimation by adap- tive lasso and clustering coefficient. The clustering-based cross prediction can outperforms the conventional time series forecasting methods. Moreover, the the clustering result can also enhance the forecasting accuracy of other forecasting methods. The proposed method can be applied on practical data, such as stock forecasting, topic trend detection.