Discovering Viral Hosts, Mutations, and Diseases using Machine Learning

Antony, Blessy

Discovering Viral Hosts, Mutations, and Diseases using Machine Learning

dc.contributor.author	Antony, Blessy	en
dc.contributor.committeechair	Murali, T. M.	en
dc.contributor.committeechair	Karpatne, Anuj	en
dc.contributor.committeemember	Bhattacharya, Debswapna	en
dc.contributor.committeemember	Radivojac, Predrag	en
dc.contributor.committeemember	Heath, Lenwood S.	en
dc.contributor.department	Computer Science and#38; Applications	en
dc.date.accessioned	2026-01-10T09:00:21Z	en
dc.date.available	2026-01-10T09:00:21Z	en
dc.date.issued	2026-01-09	en
dc.description.abstract	The discovery of a novel virus raises three important questions, namely, which host(s) can the virus infect, what mutations in the virus could affect its interaction with its hosts and enable a host-shift, and which diseases can the virus cause in humans. We propose novel machine learning (ML)-based solutions to these three different problems in computational virology. (i) We develop a viral protein language model for predicting the host infected by a virus, given only the sequence of one of its proteins. Our approach, 'Hierarchical Attention for Viral protEin-based host iNference (HAVEN)', includes a novel architecture comprising segmentation and hierarchical self-attention to tackle the challenges posed by long sequences. Pretrained on 1.2 million viral protein sequences, the model accepts any protein sequence of any virus and predicts its host. We integrate HAVEN with a prototype-based few-shot learning (FSL) classifier to generalize it to predict rare and unseen hosts, and hosts of unseen viruses. (ii) Structured datasets of known viral mutations and their effects are required to develop computational models that can predict potential detrimental changes in novel animal viruses. We leverage large language models (LLMs) to create these datasets from the results of experimental studies available as unstructured text in scientific literature. We design an open-ended task for 'scientific information extraction (SIE)' from publications and propose a unique two-step retrieval augmented generation (RAG) framework for the same. We curate a novel dataset of mutations in influenza A viral proteins. We use this dataset to benchmark our proposed approach, a wide range of LLMs, RAG-, and agent-based tools for SIE. (iii) Finally, we look at the effects of viral infections in humans. Specifically, we focus on the long-term effects of SARS-CoV-2 (or long COVID) wherein patients experience the persistence of COVID-19 symptoms for a long period of time after their initial SARS-CoV-2 infection. We propose an ML-based classification pipeline to predict the diagnosis of long COVID in COVID-19 patients using their electronic health records (EHRs) in the National COVID Cohort Collaborative, which is the largest collection of clinical data across the US. Using techniques to explain our models' prediction for each patient, we uncover many features that were correlated with long COVID. We also evaluate the impact of different data sources on our long COVID prediction models using a novel a cross-site analysis.	en
dc.description.abstractgeneral	Viruses are one of the primary pathogens causing infectious diseases. There is a rise in the frequency of outbreaks of human infectious diseases across the globe. Several viruses originate in animals, evolve though mutations, and shift hosts to infect humans. It is important to detect the potential of animal viruses to infect humans in order to avoid, prepare for, and tackle future infectious disease epidemics through well-informed decisions. We propose novel artificial intelligence (AI)-based solutions to three important questions namely, which host(s) can a virus infect, what mutations in the virus could affect its interaction with hosts, and which diseases can the virus cause in humans. We develop "Hierarchical Attention for Viral protEin-based host iNference (HAVEN)" based on the architecture of large language models (LLMs) such as ChatGPT. We train HAVEN to learn the properties of protein sequences of viruses and predict their hosts. HAVEN can also identify rare, unseen hosts and predict hosts of unseen viruses. Next we focus on the mutations in a virus that allow it to shift from one host to another and infect humans. Results from experimental studies analyzing the effects of viral mutations on virus-host interaction are available primarily in the form of unstructured text in scientific publications. We seek to employ LLMs to retrieve this information from the scientific literature and create these datasets. Retrieval augmented generation (RAG) is framework where an AI system first retrieves relevant information from a provided source and leverages it to generate accurate answers. We design a novel task for LLMs to perform 'scientific information extraction (SIE)' from publications and propose a unique two-step RAG framework for the same. We manually curate a novel dataset of mutations in influenza A viral proteins. We use this dataset to benchmark our proposed approach, a wide range of LLMs, and state-of-the-art RAG-based methods for SIE. Finally, we focus on the long-term effects of SARS-CoV-2. Long COVID is a disease condition wherein patients experience the persistence of COVID-19 symptoms for a long period of time after their initial COVID-19 infection. We trained prediction models using electronic health record (EHRs) of COVID-19 patients from during their infection phase. We show that these machine learning models can effectively predict the future occurrence of long COVID, generalize to different sources of EHR data, and highlight informative indicators in EHRs for early diagnosis. The contributions in these thesis are aimed towards developing a coherent system for pandemic preparedness and prevention.	en
dc.description.degree	Doctor of Philosophy	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:45633	en
dc.identifier.uri	https://hdl.handle.net/10919/140731	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	infectious diseases	en
dc.subject	COVID-19	en
dc.subject	machine learning	en
dc.subject	virus-host prediction	en
dc.subject	protein language models	en
dc.subject	generalizability	en
dc.subject	scientific information extraction	en
dc.subject	retrieval-augmented generation	en
dc.subject	long covid	en
dc.title	Discovering Viral Hosts, Mutations, and Diseases using Machine Learning	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Doctor of Philosophy	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Antony_B_D_2026.pdf
Size:: 23.06 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations