Mathematical Modeling and Deconvolution for Molecular Characterization of Tissue Heterogeneity
MetadataShow full item record
Tissue heterogeneity, arising from intermingled cellular or tissue subtypes, significantly obscures the analyses of molecular expression data derived from complex tissues. Existing computational methods performing data deconvolution from mixed subtype signals almost exclusively rely on supervising information, requiring subtype-specific markers, the number of subtypes, or subtype compositions in individual samples. We develop a fully unsupervised deconvolution method to dissect complex tissues into molecularly distinctive tissue or cell subtypes directly from mixture expression profiles. We implement an R package, deconvolution by Convex Analysis of Mixtures (debCAM) that can automatically detect tissue or cell-specific markers, determine the number of constituent sub-types, calculate subtype proportions in individual samples, and estimate tissue/cell-specific expression profiles. We demonstrate the performance and biomedical utility of debCAM on gene expression, methylation, and proteomics data. With enhanced data preprocessing and prior knowledge incorporation, debCAM software tool will allow biologists to perform a deep and unbiased characterization of tissue remodeling in many biomedical contexts. Purified expression profiles from physical experiments provide both ground truth and a priori information that can be used to validate unsupervised deconvolution results or improve supervision for various deconvolution methods. Detecting tissue or cell-specific expressed markers from purified expression profiles plays a critical role in molecularly characterizing and determining tissue or cell subtypes. Unfortunately, classic differential analysis assumes a convenient test statistic and associated null distribution that is inconsistent with the definition of markers and thus results in a high false positive rate or lower detection power. We describe a statistically-principled marker detection method, One Versus Everyone Subtype Exclusively-expressed Genes (OVESEG) test, that estimates a mixture null distribution model by applying novel permutation schemes. Validated with realistic synthetic data sets on both type 1 error and detection power, OVESEG-test applied to benchmark gene expression data sets detects many known and de novo subtype-specific expressed markers. Subsequent supervised deconvolution results, obtained using markers detected by the OVESEG-test, showed superior performance when compared with popular peer methods. While the current debCAM approach can dissect mixed signals from multiple samples into the 'averaged' expression profiles of subtypes, many subsequent molecular analyses of complex tissues require sample-specific deconvolution where each sample is a mixture of 'individualized' subtype expression profiles. The between-sample variation embedded in sample-specific subtype signals provides critical information for detecting subtype-specific molecular networks and uncovering hidden crosstalk. However, sample-specific deconvolution is an underdetermined and challenging problem because there are more variables than observations. We propose and develop debCAM2.0 to estimate sample-specific subtype signals by nuclear norm regularization, where the hyperparameter value is determined by random entry exclusion based cross-validation scheme. We also derive an efficient optimization approach based on ADMM to enable debCAM2.0 application in large-scale biological data analyses. Experimental results on realistic simulation data sets show that debCAM2.0 can successfully recover subtype-specific correlation networks that is unobtainable otherwise using existing deconvolution methods.
General Audience Abstract
Tissue samples are essentially mixtures of tissue or cellular subtypes where the proportions of individual subtypes vary across different tissue samples. Data deconvolution aims to dissect tissue heterogeneity into biologically important subtypes, their proportions, and their marker genes. The physical solution to mitigate tissue heterogeneity is to isolate pure tissue components prior to molecular profiling. However, these experimental methods are time-consuming, expensive and may alter the expression values during isolation. Existing literature primarily focuses on supervised deconvolution methods which require a priori information. This approach has an inherent problem as it relies on the quality and accuracy of the a priori information. In this dissertation, we propose and develop a fully unsupervised deconvolution method - deconvolution by Convex Analysis of Mixtures (debCAM) that can estimate the mixing proportions and 'averaged' expression profiles of individual subtypes present in heterogeneous tissue samples. Furthermore, we also propose and develop debCAM2.0 that can estimate 'individualized' expression profiles of participating subtypes in complex tissue samples. Subtype-specific expressed markers, or marker genes (MGs), serves as critical a priori information for supervised deconvolution. MGs are exclusively and consistently expressed in a particular tissue or cell subtype while detecting such unique MGs involving many subtypes constitutes a challenging task. We propose and develop a statistically-principled method - One Versus Everyone Subtype Exclusively-expressed Genes (OVESEG-test) for robust detection of MGs from purified profiles of many subtypes.
- Doctoral Dissertations