Machine learning enabled bioinformatics tools for analysis of biologically diverse samples

Lu, Yingzhou2023-08-262023-08-262023-08-25vt_gsexam:38371http://hdl.handle.net/10919/116137Advanced molecular profiling technologies, utilizing the entire human genome, have opened new avenues to study biological systems. In recent decades, the generation of vast volumes of multi-omics data, spanning a broad range of phenotypes. Development of advanced bioinformatics tools to identify informative biomarkers from these data becomes increasingly important. These tools are crucial to extract meaningful biomarkers from this data, especially for understanding the biological pathways responsible for disease development. The identification of signature genes and the analysis of differentially networked genes are two fundamental and critically important tasks. However, many current methodologies employ test statistics that don't align perfectly with the signature definition, potentially leading to the identification of imprecise signatures. It may be challenging because the test statistics employed by many prevailing methods fall short of fulfilling the exact definition of a marker genes, inherently leaving them susceptible to deriving inaccurate features. The problem is further compounded when attempting to identify marker genes across biologically diverse samples, especially when comparing more than two biological conditions. Additionally, traditional differential group analysis or co-expression analysis under singular conditions often falls short in certain scenarios. For instance, the subtle expression levels of transcription factors (TFs) make their detection daunting, despite their pivotal role in guiding gene expression. Pinpointing the intricate network landscape of complex ailments and isolating core genes for subsequent analysis are challenging tasks. Yet, these marker genes are instrumental in identifing potential pivotal pathways. Multi-omics data, with its inherent complexity and diversity, presents unique challenges that traditional methods might struggle to address effectively. Recognizing this, our team sought to introduce new and innovative techniques specifically designed to handle this intricate dataset. To overcome these challenges, it is vital to develop and adopt innovative methods tailored to handle the complexity and diversity inherent in multi-omics data. In response to these challenges, we have pioneered the Cosine-based One-sample Test (COT), a method meticulously crafted for the analysis of biologically diverse samples. Tailored to discern marker genes across a spectrum of subtypes using their expression profiles, COT employs a one-sample test framework. The test statistic within COT utilizes cosine similarity, comparing a molecule's expression profile across various subtypes with the precise mathematical representation of ideal marker genes. To ensure ease of application and accessibility, we've encapsulated the COT workflow within a Python package. To assess its effectiveness, we undertook an exhaustive evaluation, juxtaposing the marker genes detection capabilities of COT against its contemporaries. This evaluation employed realistic simulation data. Our findings indicated that COT was not only adept at handling gene expression data but was also proficient with proteomics data. This data, sourced from enriched tissue or cell subtype samples, further accentuated COT's superior performance. We demonstrated the heightened effectiveness of COT when applied to gene expression and proteomics data originating from distinct tissue or cell subtypes. This led to innovative findings and hypotheses in several biomedical case studies. Additionally, we have enhanced the Differential Dependency Network (DDN) framework to detect network rewiring between different conditions where significantly rewired network modes serve as informative biomarkers. Using cross-condition data and a block-wise Lasso network model, DDN detects significant network rewiring together with a subnetwork of hub molecular entities. In DDN 3.0, we took the imbalanced sample size into the consideration, integrated several acceleration strategies to enable it to handle large datasets, and enhanced the network presentation for more informative network displays including color-coded differential dependency network and gradient heatmap. We applied it to the simulated data and real data to detect critical changes in molecular network topology. The current tool stands as a valuable blueprint for the development and validation of mechanistic disease models. This foundation aids in offering a coherent interpretation of data, deepening our understanding of disease biology, and sparking new hypotheses ripe for subsequent validation and exploration. As we chart our future course, our vision is to expand the scope of tools like COT and DDN 3.0, explore the vast realm of multi-omics data, including those from longitudinal studies or clinical trials. We're looking at incorporating datasets from longitudinal studies and clinical trials – domains where data complexity scales to new heights. We believe that these tools can facilitate more nuanced and comprehensive understanding of disease development and progression. Furthermore, by integrating these methods with other advanced bioinformatics and machine learning tools, we aim to create a holistic pipeline that will allow for seamless extraction of significant biomarkers and actionable insights from multi-omics data. This is a promising step towards precision medicine, where individual genomic information can guide personalized treatment strategies.ETDenCreative Commons Attribution-NonCommercial 4.0 Internationalmachine learningbiomarkerspathway analysisdifferential network analysismulti- omics integrationMachine learning enabled bioinformatics tools for analysis of biologically diverse samplesDissertation