Unsupervised Signal Deconvolution for Multiscale Characterization of Tissue Heterogeneity
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Characterizing complex tissues requires precise identification of distinctive cell types, cell-specific signatures, and subpopulation proportions. Tissue heterogeneity, arising from multiple cell types, is a major confounding factor in studying individual subpopulations and repopulation dynamics. Tissue heterogeneity cannot be resolved directly by most global molecular and genomic profiling methods. While signal deconvolution has widespread applications in many real-world problems, there are significant limitations associated with existing methods, mainly unrealistic assumptions and heuristics, leading to inaccurate or incorrect results. In this study, we formulate the signal deconvolution task as a blind source separation problem, and develop novel unsupervised deconvolution methods within the Convex Analysis of Mixtures (CAM) framework, for characterizing multi-scale tissue heterogeneity. We also explanatorily test the application of Significant Intercellular Genomic Heterogeneity (SIGH) method.
Unlike existing deconvolution methods, CAM can identify tissue-specific markers directly from mixed signals, a critical task, without relying on any prior knowledge. Fundamental to the success of our approach is a geometric exploitation of tissue-specific markers and signal non-negativity. Using a well-grounded mathematical framework, we have proved new theorems showing that the scatter simplex of mixed signals is a rotated and compressed version of the scatter simplex of pure signals and that the resident markers at the vertices of the scatter simplex are the tissue-specific markers. The algorithm works by geometrically locating the vertices of the scatter simplex of measured signals and their resident markers. The minimum description length (MDL) criterion is applied to determine the number of tissue populations in the sample. Based on CAM principle, we integrated nonnegative independent component analysis (nICA) and convex matrix factorization (CMF) methods, developed CAM-nICA/CMF algorithm, and applied them to multiple gene expression, methylation and protein datasets, achieving very promising results validated by the ground truth or gene enrichment analysis. We integrated CAM with compartment modeling (CM) and developed multi-tissue compartment modeling (MTCM) algorithm, tested on real DCE-MRI data derived from mouse models with consistent and plausible results. We also developed an open-source R-Java software package that implements various CAM based algorithms, including an R package approved by Bioconductor specifically for tumor-stroma deconvolution.
While intercellular heterogeneity is often manifested by multiple clones with distinct sequences, systematic efforts to characterize intercellular genomic heterogeneity must effectively distinguish significant genuine clonal sequences from probabilistic fake derivatives. Based on the preliminary studies originally targeting immune T-cells, we tested and applied the SIGH algorithm to characterize intercellular heterogeneity directly from mixed sequencing reads. SIGH works by exploiting the statistical differences in both the sequencing error rates at different nucleobases and the read counts of fake sequences in relation to genuine clones of variable abundance.