Advanced Robust Statistical Learning Methods with Application in Healthcare and Manufacturing
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This dissertation presents the development and validation of advanced robust statistical methods tailored for applications in healthcare and manufacturing. This work consists of three main parts, each addressing unique challenges and demonstrating the necessity of robust algorithms in statistical learning. In the first part, motivated by the need to understand the relationship between brain networks and phenotypes of interest in small-scale neuroimaging studies with limited sample size, I developed a flow-based generative model termed Disentangled Adversarial Flow or DAF for short, which leverages large-scale multi-source datasets to improve prediction accuracy in neuroimaging studies with smaller sample sizes. A bidirectional-generative architecture and a kernel-based dependence measure are utilized to generate domain-invariant brain connectome. An ensemble-based DAF regression framework is proposed to integrate information from multiple source datasets to improve prediction on the target dataset. This framework ensures reliable predictions with limited sample sizes by borrowing information from other data sources despite the heterogeneity across different sources, exemplifying robustness in statistical learning. Similar challenges arise in the manufacturing context, where variations in product designs, process parameters, and sensor configurations generate diverse data distributions. This poses challenges for developing machine learning pipelines that can consistently achieve high performance under varying conditions. Motivated by this, the second part of the dissertation introduces a weighted ensemble mechanism based on the Bayesian Latent Space Model recommender system that optimizes sparse ensemble weights while incorporating uncertainty quantification. This method allows automatically selecting and adapting optimal pipelines, which helps data-driven decision-making in industrial settings. By automating the selection and adaptation of optimal machine learning pipelines, this method demonstrates robustness by maintaining high performance in the face of changing industrial data conditions. Distribution shifts are also common in medical records, where heterogeneity across different individuals hinders automated diagnosis for patients. A robust algorithm could generalize across different patients and lead to more accurate personalized patient care. Inspired by this, the third part proposes a latent factor model based on Interleaved-window Transformer to characterize the inter-subject heterogeneity, focusing on heterogeneous physiological time series data derived from Electronic Health Records, electrocardiograms, electroencephalograms and etc. Different factors in the latent factor model represent different characteristics of the time series. These latent factors are linked to the response through subject-specific weight, which captures varying contributions to the response in different subjects. Contrastive learning is utilized to estimate the weights for new subject not seen in the training phase. This part underlines the theme of robustness by developing a model that adapts to individual differences, ensuring that the statistical learning methods are effective across diverse patient data. This dissertation demonstrates the value of robustness as a unifying theme in advancing statistical learning methodologies and their applications.