Browsing by Author "Cheng, Zuolin"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- Conditional Multifactorial Contingency (CMC) Model and Its ApplicationsCheng, Zuolin (Virginia Tech, 2023-01-17)In biology and bioinformatics, a variety of data share a common property that challenges numerous cutting-edge research studies: heterogeneities at the individual level with respect to more than one factor. Examples of such heterogeneities include but are not limited to: 1) unequal susceptibility of different patients, and 2) large diversity in gene length, GC content, etc., along with the resulting gene characteristics. For many biological data analysis studies, the critical first step is usually to infer null probability distribution of observed data with the heterogeneities in multiple (confounding) factors taken into account, so that we can further investigate the impact of other factor(s) of interest. Obviously, the heterogeneities heavily influence the potential conclusions that we may draw from statistical analyses of the data. However, modeling such heterogeneities has been challenging, not only due to the inapplicable explicit modeling of all factors with heterogeneous effects on the data, but also because of the non-independence of many factors from one another. Existing methods, either partially/fully neglected the heterogeneity issue at all, or took care of each factor's heterogeneity in isolation. Evidences have shown the insufficiency of such strategies and the errors they may produce in downstream analyses. The emergence of large-scale data sets provides the opportunity to directly and comprehensively learn the heterogeneity from the data without explicitly modeling the mechanisms behind or exerting strong assumptions. The data, as often stored or organized as multidimensional contingency tensors, lead to a natural perspective of modeling heterogeneity with each impact factor of interest being one dimension. The heterogeneity in each factor's impact on the variable of interest can be captured by the marginal property of the data tensor with respect to the corresponding dimension. For instance, in a single-cell sequencing dataset, which can be organized as a matrix with each row representing a gene and each column representing a cell, the heterogeneity caused by both the gene and cell factors can be modeled. In this dissertation, we develop a novel model, Conditional Multifactorial Contingency (CMC), that models the intertwined heterogeneities in all dimensions of the data tensor and infers the probability distribution of each entry of the data tensor jointly conditioned on these heterogeneities. In the proposed CMC model, the problem is formulated as a maximum entropy problem of the contingency tensor's probability distribution subject to the marginal constraints, under the assumption that the individuals within each dimension are independent. The marginal constraints are applied to the expected value instead of observed trial outcomes, which plays a key role in avoiding the innumerable combinations of trial outcomes and leading to an elegant expression form of the entry's probability distribution. The model is first developed for 3D binary data matrix, then extended to multidimensional data tensors and integer data tensors. Furthermore, missing values are taken into account and CMC is extended to be compatible with data with missing values. Being empowered by CMC, we conducted four case studies for real-world bioinformatics research problems: (1) driving transcription factor (TF) identification; (2) scRNA-seq data normalization; (3) cancer-associated gene identification; (4) cell similarity quantification. For each of these case studies, we proposed a whole analysis framework and specific adaptation design for CMC. For the driving-TF identification, compared with traditional methods, we considered the variations in the gene's binding affinity in addition to the typically considered variations in TF's binding affinity. The driving TFs were identified by comparing the observed binding state and the estimated binding probability conditioned on TF/gene binding affinities. For the scRNA-seq data normalization, besides gene factor and cell factor, we figured out one more factor impacting the read counts, cDNA length, and applied CMC to comprehensively analyze the three factors. For cancer-associated gene identification, the CMC model is applied to systematically model the patient, gene, and mutation type factors in the mutation count data. As for the last application, to the best of our knowledge, our solution is the first proposed cell-to-cell-type similarity quantification method, thanks to the availability of CMC to systematically model and remove the impact of cell and gene factors. We studied the theoretical properties of the proposed model and validated the effectiveness and efficiency of our method through experiments. The uniqueness of the probability solution and the convergence of the algorithm was proved. In the endeavor to identify true driving TFs, CMC significantly boosted the best record of success rate, which was proved using data with ground truth. Besides, in an exploratory study without ground truth, in addition to the previously known TFs, Olig1 (ranks 2nd), Olig2 (ranks 3rd), and Sox10 (ranks 4th), we successfully identified Ppp1r14b (ranks 1st) and Zfp36l1 (ranks 6th) that function in oligodendrocyte lineage development, which was validated via biological knock-out experiments and, has led to genuine biological discoveries. In the scRNA-seq data normalization, experimental results show that, by taking the cell, gene, and cDNA-length factors into account, the normalized data achieves lower variances for housekeeping genes than the peer methods. Besides, the data normalized by the CMC model leads to better accuracy of downstream DEG detection than that normalized by peer normalization methods. In cancer-associated gene identification, the CMC model is able to eliminate most of the likely artefactual findings resulted by considering the hidden factors separately. In the cell similarity quantification, CMC based model enables the identification of cell types by establishing between-species cell similarity quantification, regardless of contamination in scRNA-seq data.
- Cosbin: cosine score-based iterative normalization of biologically diverse samplesWu, Chiung-Ting; Shen, Minjie; Du, Dongping; Cheng, Zuolin; Parker, Sarah J.; Lu, Yingzhou; Van Eyk, Jennifer E.; Yu, Guoqiang; Clarke, Robert; Herrington, David M.; Wang, Yue (Oxford University Press, 2022)Motivation: Data normalization is essential to ensure accurate inference and comparability of gene expression measures across samples or conditions. Ideally, gene expression data should be rescaled based on consistently expressed reference genes. However, to normalize biologically diverse samples, the most commonly used reference genes exhibit striking expression variability and size-factor or distribution-based normalization methods can be problematic when the amount of asymmetry in differential expression is significant. Results: We report an efficient and accurate data-driven method-Cosine score-based iterative normalization (Cosbin)-to normalize biologically diverse samples. Based on the Cosine scores of cross-condition expression patterns, the Cosbin pipeline iteratively eliminates asymmetric differentially expressed genes, identifies consistently expressed genes, and calculates sample-wise normalization factors. We demonstrate the superior performance and enhanced utility of Cosbin compared with six representative peer methods using both simulation and real multi-omics expression datasets. Implemented in open-source R scripts and specifically designed to address normalization bias due to significant asymmetry in differential expression across multiple conditions, the Cosbin tool complements rather than replaces the existing methods and will allow biologists to more accurately detect true molecular signals among diverse phenotypic groups. Availability and implementation: The R scripts of Cosbin pipeline are freely available at https://github.com/MinjieSh/Cosbin. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
- COT: an efficient and accurate method for detecting marker genes among many subtypesLu, Yingzhou; Wu, Chiung-Ting; Parker, Sarah J.; Cheng, Zuolin; Saylor, Georgia; Van Eyk, Jennifer E.; Yu, Guoqiang; Clarke, Robert; Herrington, David M.; Wang, Yue (Oxford University Press, 2022)Motivation: Ideally, a molecularly distinct subtype would be composed of molecular features that are expressed uniquely in the subtype of interest but in no others-so-called marker genes (MGs). MG plays a critical role in the characterization, classification or deconvolution of tissue or cell subtypes. We and others have recognized that the test statistics used by most methods do not exactly satisfy the MG definition and often identify inaccurate MG. Results: We report an efficient and accurate data-driven method, formulated as a Cosine-based One-sample Test (COT) in scatter space, to detect MG among many subtypes using subtype expression profiles. Fundamentally different from existing approaches, the test statistic in COT precisely matches the mathematical definition of an ideal MG. We demonstrate the performance and utility of COT on both simulated and real gene expression and proteomics data. The open source Python/R tool will allow biologists to efficiently detect MG and perform a more comprehensive and unbiased molecular characterization of tissue or cell subtypes in many biomedical contexts. Nevertheless, COT complements not replaces existing methods. Availability and implementation: The Python COT software with a detailed user's manual and a vignette are freely available at https://github.com/MintaYLu/COT. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
- Paternal malnutrition programs breast cancer risk and tumor metabolism in offspringda Cruz, Raquel S.; Carney, Elissa J.; Clarke, Johan; Cao, Hong; Cruz, M. Idalia; Benitez, Carlos; Jin, Lu; Fu, Yi; Cheng, Zuolin; Wang, Yue; de Assis, Sonia (2018-08-30)Background While many studies have shown that maternal factors in pregnancy affect the cancer risk for offspring, few studies have investigated the impact of paternal exposures on their progeny’s risk of this disease. Population studies generally show a U-shaped association between birthweight and breast cancer risk, with both high and low birthweight increasing the risk compared with average birthweight. Here, we investigated whether paternal malnutrition would modulate the birthweight and later breast cancer risk of daughters. Methods Male mice were fed AIN93G-based diets containing either 17.7% (control) or 8.9% (low-protein (LP)) energy from protein from 3 to 10 weeks of age. Males on either group were mated to females raised on a control diet. Female offspring from control and LP fathers were treated with 7,12-dimethylbenz[a]anthracene (DMBA) to initiate mammary carcinogenesis. Mature sperm from fathers and mammary tissue and tumors from female offspring were used for epigenetic and other molecular analyses. Results We found that paternal malnutrition reduces the birthweight of daughters and leads to epigenetic and metabolic reprogramming of their mammary tissue and tumors. Daughters of LP fathers have higher rates of mammary cancer, with tumors arising earlier and growing faster than in controls. The energy sensor, the AMP-activated protein kinase (AMPK) pathway, is suppressed in both mammary glands and tumors of LP daughters, with consequent activation of mammalian target of rapamycin (mTOR) signaling. Furthermore, LP mammary tumors show altered amino-acid metabolism with increased glutamine utilization. These changes are linked to alterations in noncoding RNAs regulating those pathways in mammary glands and tumors. Importantly, we detect alterations in some of the same microRNAs/target genes found in our animal model in breast tumors of women from populations where low birthweight is prevalent. Conclusions Our study suggests that ancestral paternal malnutrition plays a role in programming offspring cancer risk and phenotype by likely providing a metabolic advantage to cancer cells.