Discovering Viral Hosts, Mutations, and Diseases using Machine Learning

Antony, Blessy

Discovering Viral Hosts, Mutations, and Diseases using Machine Learning

Files

Antony_B_D_2026.pdf (23.06 MB)

Downloads: 155

Date

2026-01-09

Authors

Antony, Blessy

Publisher

Virginia Tech

Abstract

The discovery of a novel virus raises three important questions, namely, which host(s) can the virus infect, what mutations in the virus could affect its interaction with its hosts and enable a host-shift, and which diseases can the virus cause in humans. We propose novel machine learning (ML)-based solutions to these three different problems in computational virology.

(i) We develop a viral protein language model for predicting the host infected by a virus, given only the sequence of one of its proteins. Our approach, 'Hierarchical Attention for Viral protEin-based host iNference (HAVEN)', includes a novel architecture comprising segmentation and hierarchical self-attention to tackle the challenges posed by long sequences. Pretrained on 1.2 million viral protein sequences, the model accepts any protein sequence of any virus and predicts its host. We integrate HAVEN with a prototype-based few-shot learning (FSL) classifier to generalize it to predict rare and unseen hosts, and hosts of unseen viruses.

(ii) Structured datasets of known viral mutations and their effects are required to develop computational models that can predict potential detrimental changes in novel animal viruses. We leverage large language models (LLMs) to create these datasets from the results of experimental studies available as unstructured text in scientific literature. We design an open-ended task for 'scientific information extraction (SIE)' from publications and propose a unique two-step retrieval augmented generation (RAG) framework for the same. We curate a novel dataset of mutations in influenza A viral proteins. We use this dataset to benchmark our proposed approach, a wide range of LLMs, RAG-, and agent-based tools for SIE.

(iii) Finally, we look at the effects of viral infections in humans. Specifically, we focus on the long-term effects of SARS-CoV-2 (or long COVID) wherein patients experience the persistence of COVID-19 symptoms for a long period of time after their initial SARS-CoV-2 infection. We propose an ML-based classification pipeline to predict the diagnosis of long COVID in COVID-19 patients using their electronic health records (EHRs) in the National COVID Cohort Collaborative, which is the largest collection of clinical data across the US. Using techniques to explain our models' prediction for each patient, we uncover many features that were correlated with long COVID. We also evaluate the impact of different data sources on our long COVID prediction models using a novel a cross-site analysis.

Keywords

infectious diseases, COVID-19, machine learning, virus-host prediction, protein language models, generalizability, scientific information extraction, retrieval-augmented generation, long covid

Persistent link

https://hdl.handle.net/10919/140731

Collections

Doctoral Dissertations

Full item page

Discovering Viral Hosts, Mutations, and Diseases using Machine Learning

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections