Topic Model-based Mass Spectrometric Data Analysis in Cancer Biomarker Discovery Studies
MetadataShow full item record
Identification of disease-related alterations in molecular and cellular mechanisms may reveal useful biomarkers for human diseases including cancers. High-throughput omic technologies for identifying and quantifying multi-level biological molecules (e.g., proteins, glycans, and metabolites) have facilitated the advances in biological research in recent years. Liquid (or gas) chromatography coupled with mass spectrometry (LC/GC-MS) has become an essential tool in such large-scale omic studies. Appropriate LC/GC-MS data preprocessing pipelines are needed to detect true differences between biological groups. Challenges exist in several aspects of MS data analysis. Specifically for biomarker discovery, one fundamental challenge in quantitation of biomolecules is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based omic studies. Purification of mass spectometric data is highly desired prior to subsequent differential analysis. In this research dissertation, we majorly target at addressing the purification problem through probabilistic modeling. We propose an intensity-level purification model (IPM) to computationally purify LC/GC-MS based cancerous data in biomarker discovery studies. We further extend IPM to scan-level purification model (SPM) by considering information from extracted ion chromatogram (EIC, scan-level feature). Both IPM and SPM belong to the category of topic modeling approach, which aims to identify the underlying "topics" (sources) and their mixture proportions in composing the heterogeneous data. Additionally, denoise deconvolution model (DMM) is proposed to capture the noise signals in samples based on purified profiles. Variational expectation-maximization (VEM) and Markov chain Monte Carlo (MCMC) methods are used to draw inference on the latent variables and estimate the model parameters. Before we come to purification, other research topics in related to mass spectrometric data analysis for cancer biomarker discovery are also investigated in this dissertation. Chapter 3 discusses the developed methods in the differential analysis of LC/GC-MS based omic data, specifically for the preprocessing in data of LC-MS profiled glycans. Chapter 4 presents the assumptions and inference details of IPM, SPM, and DDM. A latent Dirichlet allocation (LDA) core is used to model the heterogeneous cancerous data as mixtures of topics consisting of sample-specific pure cancerous source and non-cancerous contaminants. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum and tissue proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis. Chapter 5 elaborates these applications in cancer biomarker discovery, where typical single omic and integrative analysis of multi-omic studies are included.
- Doctoral Dissertations