Discovering Viral Hosts, Mutations, and Diseases using Machine Learning
| dc.contributor.author | Antony, Blessy | en |
| dc.contributor.committeechair | Murali, T. M. | en |
| dc.contributor.committeechair | Karpatne, Anuj | en |
| dc.contributor.committeemember | Bhattacharya, Debswapna | en |
| dc.contributor.committeemember | Radivojac, Predrag | en |
| dc.contributor.committeemember | Heath, Lenwood S. | en |
| dc.contributor.department | Computer Science and#38; Applications | en |
| dc.date.accessioned | 2026-01-10T09:00:21Z | en |
| dc.date.available | 2026-01-10T09:00:21Z | en |
| dc.date.issued | 2026-01-09 | en |
| dc.description.abstract | The discovery of a novel virus raises three important questions, namely, which host(s) can the virus infect, what mutations in the virus could affect its interaction with its hosts and enable a host-shift, and which diseases can the virus cause in humans. We propose novel machine learning (ML)-based solutions to these three different problems in computational virology. (i) We develop a viral protein language model for predicting the host infected by a virus, given only the sequence of one of its proteins. Our approach, 'Hierarchical Attention for Viral protEin-based host iNference (HAVEN)', includes a novel architecture comprising segmentation and hierarchical self-attention to tackle the challenges posed by long sequences. Pretrained on 1.2 million viral protein sequences, the model accepts any protein sequence of any virus and predicts its host. We integrate HAVEN with a prototype-based few-shot learning (FSL) classifier to generalize it to predict rare and unseen hosts, and hosts of unseen viruses. (ii) Structured datasets of known viral mutations and their effects are required to develop computational models that can predict potential detrimental changes in novel animal viruses. We leverage large language models (LLMs) to create these datasets from the results of experimental studies available as unstructured text in scientific literature. We design an open-ended task for 'scientific information extraction (SIE)' from publications and propose a unique two-step retrieval augmented generation (RAG) framework for the same. We curate a novel dataset of mutations in influenza A viral proteins. We use this dataset to benchmark our proposed approach, a wide range of LLMs, RAG-, and agent-based tools for SIE. (iii) Finally, we look at the effects of viral infections in humans. Specifically, we focus on the long-term effects of SARS-CoV-2 (or long COVID) wherein patients experience the persistence of COVID-19 symptoms for a long period of time after their initial SARS-CoV-2 infection. We propose an ML-based classification pipeline to predict the diagnosis of long COVID in COVID-19 patients using their electronic health records (EHRs) in the National COVID Cohort Collaborative, which is the largest collection of clinical data across the US. Using techniques to explain our models' prediction for each patient, we uncover many features that were correlated with long COVID. We also evaluate the impact of different data sources on our long COVID prediction models using a novel a cross-site analysis. | en |
| dc.description.abstractgeneral | Viruses are one of the primary pathogens causing infectious diseases. There is a rise in the frequency of outbreaks of human infectious diseases across the globe. Several viruses originate in animals, evolve though mutations, and shift hosts to infect humans. It is important to detect the potential of animal viruses to infect humans in order to avoid, prepare for, and tackle future infectious disease epidemics through well-informed decisions. We propose novel artificial intelligence (AI)-based solutions to three important questions namely, which host(s) can a virus infect, what mutations in the virus could affect its interaction with hosts, and which diseases can the virus cause in humans. We develop "Hierarchical Attention for Viral protEin-based host iNference (HAVEN)" based on the architecture of large language models (LLMs) such as ChatGPT. We train HAVEN to learn the properties of protein sequences of viruses and predict their hosts. HAVEN can also identify rare, unseen hosts and predict hosts of unseen viruses. Next we focus on the mutations in a virus that allow it to shift from one host to another and infect humans. Results from experimental studies analyzing the effects of viral mutations on virus-host interaction are available primarily in the form of unstructured text in scientific publications. We seek to employ LLMs to retrieve this information from the scientific literature and create these datasets. Retrieval augmented generation (RAG) is framework where an AI system first retrieves relevant information from a provided source and leverages it to generate accurate answers. We design a novel task for LLMs to perform 'scientific information extraction (SIE)' from publications and propose a unique two-step RAG framework for the same. We manually curate a novel dataset of mutations in influenza A viral proteins. We use this dataset to benchmark our proposed approach, a wide range of LLMs, and state-of-the-art RAG-based methods for SIE. Finally, we focus on the long-term effects of SARS-CoV-2. Long COVID is a disease condition wherein patients experience the persistence of COVID-19 symptoms for a long period of time after their initial COVID-19 infection. We trained prediction models using electronic health record (EHRs) of COVID-19 patients from during their infection phase. We show that these machine learning models can effectively predict the future occurrence of long COVID, generalize to different sources of EHR data, and highlight informative indicators in EHRs for early diagnosis. The contributions in these thesis are aimed towards developing a coherent system for pandemic preparedness and prevention. | en |
| dc.description.degree | Doctor of Philosophy | en |
| dc.format.medium | ETD | en |
| dc.identifier.other | vt_gsexam:45633 | en |
| dc.identifier.uri | https://hdl.handle.net/10919/140731 | en |
| dc.language.iso | en | en |
| dc.publisher | Virginia Tech | en |
| dc.rights | In Copyright | en |
| dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
| dc.subject | infectious diseases | en |
| dc.subject | COVID-19 | en |
| dc.subject | machine learning | en |
| dc.subject | virus-host prediction | en |
| dc.subject | protein language models | en |
| dc.subject | generalizability | en |
| dc.subject | scientific information extraction | en |
| dc.subject | retrieval-augmented generation | en |
| dc.subject | long covid | en |
| dc.title | Discovering Viral Hosts, Mutations, and Diseases using Machine Learning | en |
| dc.type | Dissertation | en |
| thesis.degree.discipline | Computer Science & Applications | en |
| thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
| thesis.degree.level | doctoral | en |
| thesis.degree.name | Doctor of Philosophy | en |
Files
Original bundle
1 - 1 of 1