Towards Network-Guided Large-Scale Foundation Models on Single-Cell Transcriptomics
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Large-scale pretrained models known as foundation models, have made breakthrough progress in the fields like NLP and computer vision. Recently, transformer-based foundation models tailored for single-cell RNA sequencing (scRNA-seq) data have shown significant potential in interpreting the 'languages' of cells through self-supervised learning on huge amounts of unlabeled scRNA-seq datasets. These models could significantly enhance our understanding of cellular functions and disease mechanisms. However, unlike text data, scRNA-seq data is high-dimensional, inherently noisy and sparse, posing unique chal- lenges. We hypothesize that a major limitation of current single-cell foundation models (scFMs) lies in their inability to effectively leverage prior biological knowledge that could provide valuable complementary insights on relationships between various genes. One of the most critical applications of scRNA-seq is the inference of gene regulatory networks (GRNs), which represent the intricate interactions between transcription factors (TFs) and their target genes. In the first part of this thesis, we propose SCREGNET, an innovative framework that combines scFMs with graph-based learning by incorporating experimentally validated transcription factor-DNA binding data in the form of networks with known regula- tory interactions for the GRN inference task. SCREGNET achieved state-of-the-art results in the gene regulatory link prediction task when compared to nine baseline methods across seven scRNA-seq benchmark datasets and demonstrated greater robustness. In the second part of the thesis, we systematically explored incorporating prior GRNs into the pretraining of scFMs. This exploration provided valuable insights into the benefits and limitations of network guidance, revealing varied effects on predictive accuracy across different downstream tasks related to chromatin and network dynamics.