Applications of Machine Learning in Source Attribution and Gene Function Prediction

TR Number



Journal Title

Journal ISSN

Volume Title


Virginia Tech


This research investigates the application of machine learning techniques in computational genomics across two distinct domains: (1) the predicting the source of bacterial pathogen using whole genome sequencing data, and (2) the functional annotation of genes using single- cell RNA sequencing data. This work proposes the development of a bioinformatics pipeline tailored for identifying genomic variants, including gene presence/absence and single nu- cleotide polymorphism. This methodology is applied to specific strains such as Salmonella enterica serovar Typhimurium and the Ralstonia solanacearum species complex. Phylo- genetic analyses along with pan-genome and positive selection studiesshow that genomic variants and evolutionary patterns of S. Typhimurium vary across sources, which suggests that sources can be accurately attributed based on genomic variants empowered by machine learning. We benchmarked seven traditional machine learning algorithms, achieving a no- table accuracy of 94.6% in host prediction for S. Typhimurium using the Random Forest model, underscored by SHAP value analyses which elucidated key predictive features. Next, the focus is shifted to the prediction of Gene Ontology terms for Arabidopsis genes using single-cell RNA-seq data. This analysis offers a detailed comparison of gene expression in root versus shoot tissues, juxtaposed with insights from bulk RNA-seq data. The integration of regulatory network data from DAP-seq significantly enhances the prediction accuracy of gene functions.



Machine Learning, Source Attribution, Whole genome sequencing, Gene function prediction, single-cell RNAseq