Statistical Machine Learning for Multi-platform Biomedical Data Analysis
MetadataShow full item record
Recent advances in biotechnologies have enabled multiplatform and large-scale quantitative measurements of biomedical events. The need to analyze the produced vast amount of imaging and genomic data stimulates various novel applications of statistical machine learning methods in many areas of biomedical research. The main objective is to assist biomedical investigators to better interpret, analyze, and understand the biomedical questions based on the acquired data. Given the computational challenges imposed by these high-dimensional and complex data, machine learning research finds its new opportunities and roles. In this dissertation thesis, we propose to develop, test and apply novel statistical machine learning methods to analyze the data mainly acquired by dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) and single nucleotide polymorphism (SNP) microarrays. The research work focuses on: (1) tissue-specific compartmental analysis for dynamic contrast-enhanced MR imaging of complex tumors; (2) computational Analysis for detecting DNA SNP interactions in genome-wide association studies. DCE-MRI provides a noninvasive method for evaluating tumor vasculature patterns based on contrast accumulation and washout. Compartmental analysis is a widely used mathematical tool to model dynamic imaging data and can provide accurate pharmacokinetics parameter estimates. However partial volume effect (PVE) existing in imaging data would have profound effect on the accuracy of pharmacokinetics studies. We therefore propose a convex analysis of mixtures (CAM) algorithm to explicitly eliminate PVE by expressing the kinetics in each pixel as a nonnegative combination of underlying compartments and subsequently identifying pure volume pixels at the corners of the clustered pixel time series scatter plot. The algorithm is supported by a series of newly proved theorems and additional noise filtering and normalization preprocessing. We demonstrate the principle and feasibility of the CAM approach together with compartmental modeling on realistic synthetic data, and compare the accuracy of parameter estimates obtained using CAM or other relevant techniques. Experimental results show a significant improvement in the accuracy of kinetic parameter estimation. We then apply the algorithm to real DCE-MRI data of breast cancer and observe improved pharmacokinetics parameter estimation that separates tumor tissue into sub-regions with differential tracer kinetics on a pixel-by-pixel basis and reveals biologically plausible tumor tissue heterogeneity patterns. This method has combined the advantages of multivariate clustering, convex optimization and compartmental modeling approaches. Interactions among genetic loci are believed to play an important role in disease risk. Due to the huge dimension of SNP data (normally several millions in genome-wide association studies), the combinatorial search and statistical evaluation required to detect multi-locus interactions constitute a significantly challenging computational task. While many approaches have been proposed for detecting such interactions, their relative performance remains largely unclear, due to the fact that performance was evaluated on different data sources, using different performance measures, and under different experimental protocols. Given the importance of detecting gene-gene interactions, a thorough evaluation of the performance and limitations of available methods, a theoretical analysis of the interaction effect and the genetic factors it depends on, and the development of more efficient methods are warranted. Therefore, we perform a computational analysis for detect interactions among SNPs. The contributions are four-fold: (1) developed simulation tools for evaluating performance of any technique designed to detect interactions among genetic variants in case-control studies; (2) used these tools to compare performance of five popular SNP detection methods; and (3) derived analytic relationships between power and the genetic factors, which not only support the experimental results but also gives a quantitative linkage between interaction effect and these factors; (4) based on the novel insights gained by comparative and theoretical analysis, developed an efficient statistically-principled method, namely the hybrid correlation-based association (HCA) to detect interacting SNPs. The HCA algorithm is based on three correlation-based statistics, which are designed to measure the strength of multi-locus interaction with three different interaction types, covering a large portion of possible interactions. Moreover, to maximize the detection power (sensitivity) while suppressing false positive rate (or retaining moderate specificity), we also devised a strategy to hybridize these three statistics in a case-by-case way. A heuristic search strategy is also proposed to largely decrease the computational complexity, especially for high-order interaction detection. We have tested HCA in both simulation study and real disease study. HCA and the selected peer methods were compared on a large number of simulated datasets, each including multiple sets of interaction models. The assessment criteria included several power measures, family-wise type I error rate, and computational complexity. The experimental results of HCA on the simulation data indicate its promising performance in terms of a good balance between detection accuracy and computational complexity. By running on multiple real datasets, HCA also replicates plausible biomarkers reported in previous literatures.
- Doctoral Dissertations