Applications of Machine Learning in Source Attribution and Gene Function Prediction

Chinnareddy, Sandeep

Applications of Machine Learning in Source Attribution and Gene Function Prediction

dc.contributor.author	Chinnareddy, Sandeep	en
dc.contributor.committeechair	Li, Song	en
dc.contributor.committeechair	Liao, Jingqiu	en
dc.contributor.committeemember	Wang, Xuan	en
dc.contributor.committeemember	Zhang, Liqing	en
dc.contributor.department	Computer Science and Applications	en
dc.date.accessioned	2024-06-08T08:00:42Z	en
dc.date.available	2024-06-08T08:00:42Z	en
dc.date.issued	2024-06-07	en
dc.description.abstract	This research investigates the application of machine learning techniques in computational genomics across two distinct domains: (1) the predicting the source of bacterial pathogen using whole genome sequencing data, and (2) the functional annotation of genes using single- cell RNA sequencing data. This work proposes the development of a bioinformatics pipeline tailored for identifying genomic variants, including gene presence/absence and single nu- cleotide polymorphism. This methodology is applied to specific strains such as Salmonella enterica serovar Typhimurium and the Ralstonia solanacearum species complex. Phylo- genetic analyses along with pan-genome and positive selection studiesshow that genomic variants and evolutionary patterns of S. Typhimurium vary across sources, which suggests that sources can be accurately attributed based on genomic variants empowered by machine learning. We benchmarked seven traditional machine learning algorithms, achieving a no- table accuracy of 94.6% in host prediction for S. Typhimurium using the Random Forest model, underscored by SHAP value analyses which elucidated key predictive features. Next, the focus is shifted to the prediction of Gene Ontology terms for Arabidopsis genes using single-cell RNA-seq data. This analysis offers a detailed comparison of gene expression in root versus shoot tissues, juxtaposed with insights from bulk RNA-seq data. The integration of regulatory network data from DAP-seq significantly enhances the prediction accuracy of gene functions.	en
dc.description.abstractgeneral	This work applies machine learning techniques to two areas in computational biology: pre- dicting the hosts of bacterial pathogens based on their genome data, and predicting the func- tions of plant genes using single-cell gene expression data. The first part develops a method to analyze genome sequences from bacterial pathogens like Salmonella enterica serovar Ty- phimurium and the Ralstonia solanacearum species complex, identifying genomic variants, including gene presence/absence and single nucleotide polymorphism, which are variations in genetic code. By studying the evolutionary relationships and genetic diversity among dif- ferent strains, the motivation for using machine learning models to predict the sources (e.g., poultry, swine) of the pathogen genomes is established. Several machine learning models are then trained on these datasets, and the most important factors contributing to the predic- tions are identified. The second part focuses on predicting the functions of genes in the model plant species Arabidopsis thaliana using the gene expression data measured at the single-cell level to train machine learning models for identifying standardized gene function descrip- tions called Gene Ontology (GO) terms. By comparing results from single-cell and bulk tissue data, the study evaluates whether the higher resolution of single-cell data improves gene function prediction accuracy. Additionally, by incorporating information about gene regulation from a specialized experiment, the role of gene expression control in determining gene functions is explored.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:40646	en
dc.identifier.uri	https://hdl.handle.net/10919/119357	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Machine Learning	en
dc.subject	Source Attribution	en
dc.subject	Whole genome sequencing	en
dc.subject	Gene function prediction	en
dc.subject	single-cell RNAseq	en
dc.title	Applications of Machine Learning in Source Attribution and Gene Function Prediction	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chinnareddy_S_T_2024.pdf
Size:: 6.06 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses