Cheng, Zuolin2023-01-182023-01-182023-01-17vt_gsexam:36442http://hdl.handle.net/10919/113211In biology and bioinformatics, a variety of data share a common property that challenges numerous cutting-edge research studies: heterogeneities at the individual level with respect to more than one factor. Examples of such heterogeneities include but are not limited to: 1) unequal susceptibility of different patients, and 2) large diversity in gene length, GC content, etc., along with the resulting gene characteristics. For many biological data analysis studies, the critical first step is usually to infer null probability distribution of observed data with the heterogeneities in multiple (confounding) factors taken into account, so that we can further investigate the impact of other factor(s) of interest. Obviously, the heterogeneities heavily influence the potential conclusions that we may draw from statistical analyses of the data. However, modeling such heterogeneities has been challenging, not only due to the inapplicable explicit modeling of all factors with heterogeneous effects on the data, but also because of the non-independence of many factors from one another. Existing methods, either partially/fully neglected the heterogeneity issue at all, or took care of each factor's heterogeneity in isolation. Evidences have shown the insufficiency of such strategies and the errors they may produce in downstream analyses. The emergence of large-scale data sets provides the opportunity to directly and comprehensively learn the heterogeneity from the data without explicitly modeling the mechanisms behind or exerting strong assumptions. The data, as often stored or organized as multidimensional contingency tensors, lead to a natural perspective of modeling heterogeneity with each impact factor of interest being one dimension. The heterogeneity in each factor's impact on the variable of interest can be captured by the marginal property of the data tensor with respect to the corresponding dimension. For instance, in a single-cell sequencing dataset, which can be organized as a matrix with each row representing a gene and each column representing a cell, the heterogeneity caused by both the gene and cell factors can be modeled. In this dissertation, we develop a novel model, Conditional Multifactorial Contingency (CMC), that models the intertwined heterogeneities in all dimensions of the data tensor and infers the probability distribution of each entry of the data tensor jointly conditioned on these heterogeneities. In the proposed CMC model, the problem is formulated as a maximum entropy problem of the contingency tensor's probability distribution subject to the marginal constraints, under the assumption that the individuals within each dimension are independent. The marginal constraints are applied to the expected value instead of observed trial outcomes, which plays a key role in avoiding the innumerable combinations of trial outcomes and leading to an elegant expression form of the entry's probability distribution. The model is first developed for 3D binary data matrix, then extended to multidimensional data tensors and integer data tensors. Furthermore, missing values are taken into account and CMC is extended to be compatible with data with missing values. Being empowered by CMC, we conducted four case studies for real-world bioinformatics research problems: (1) driving transcription factor (TF) identification; (2) scRNA-seq data normalization; (3) cancer-associated gene identification; (4) cell similarity quantification. For each of these case studies, we proposed a whole analysis framework and specific adaptation design for CMC. For the driving-TF identification, compared with traditional methods, we considered the variations in the gene's binding affinity in addition to the typically considered variations in TF's binding affinity. The driving TFs were identified by comparing the observed binding state and the estimated binding probability conditioned on TF/gene binding affinities. For the scRNA-seq data normalization, besides gene factor and cell factor, we figured out one more factor impacting the read counts, cDNA length, and applied CMC to comprehensively analyze the three factors. For cancer-associated gene identification, the CMC model is applied to systematically model the patient, gene, and mutation type factors in the mutation count data. As for the last application, to the best of our knowledge, our solution is the first proposed cell-to-cell-type similarity quantification method, thanks to the availability of CMC to systematically model and remove the impact of cell and gene factors. We studied the theoretical properties of the proposed model and validated the effectiveness and efficiency of our method through experiments. The uniqueness of the probability solution and the convergence of the algorithm was proved. In the endeavor to identify true driving TFs, CMC significantly boosted the best record of success rate, which was proved using data with ground truth. Besides, in an exploratory study without ground truth, in addition to the previously known TFs, Olig1 (ranks 2nd), Olig2 (ranks 3rd), and Sox10 (ranks 4th), we successfully identified Ppp1r14b (ranks 1st) and Zfp36l1 (ranks 6th) that function in oligodendrocyte lineage development, which was validated via biological knock-out experiments and, has led to genuine biological discoveries. In the scRNA-seq data normalization, experimental results show that, by taking the cell, gene, and cDNA-length factors into account, the normalized data achieves lower variances for housekeeping genes than the peer methods. Besides, the data normalized by the CMC model leads to better accuracy of downstream DEG detection than that normalized by peer normalization methods. In cancer-associated gene identification, the CMC model is able to eliminate most of the likely artefactual findings resulted by considering the hidden factors separately. In the cell similarity quantification, CMC based model enables the identification of cell types by establishing between-species cell similarity quantification, regardless of contamination in scRNA-seq data.ETDenIn CopyrightConditional Multifactorial Contingency Modeltensorheterogeneitymultiple factorsConditional Multifactorial Contingency (CMC) Model  and Its ApplicationsDissertation