Discovering Viral Hosts, Mutations, and Diseases using Machine Learning
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The discovery of a novel virus raises three important questions, namely, which host(s) can the virus infect, what mutations in the virus could affect its interaction with its hosts and enable a host-shift, and which diseases can the virus cause in humans. We propose novel machine learning (ML)-based solutions to these three different problems in computational virology.
(i) We develop a viral protein language model for predicting the host infected by a virus, given only the sequence of one of its proteins. Our approach, 'Hierarchical Attention for Viral protEin-based host iNference (HAVEN)', includes a novel architecture comprising segmentation and hierarchical self-attention to tackle the challenges posed by long sequences. Pretrained on 1.2 million viral protein sequences, the model accepts any protein sequence of any virus and predicts its host. We integrate HAVEN with a prototype-based few-shot learning (FSL) classifier to generalize it to predict rare and unseen hosts, and hosts of unseen viruses.
(ii) Structured datasets of known viral mutations and their effects are required to develop computational models that can predict potential detrimental changes in novel animal viruses. We leverage large language models (LLMs) to create these datasets from the results of experimental studies available as unstructured text in scientific literature. We design an open-ended task for 'scientific information extraction (SIE)' from publications and propose a unique two-step retrieval augmented generation (RAG) framework for the same. We curate a novel dataset of mutations in influenza A viral proteins. We use this dataset to benchmark our proposed approach, a wide range of LLMs, RAG-, and agent-based tools for SIE.
(iii) Finally, we look at the effects of viral infections in humans. Specifically, we focus on the long-term effects of SARS-CoV-2 (or long COVID) wherein patients experience the persistence of COVID-19 symptoms for a long period of time after their initial SARS-CoV-2 infection. We propose an ML-based classification pipeline to predict the diagnosis of long COVID in COVID-19 patients using their electronic health records (EHRs) in the National COVID Cohort Collaborative, which is the largest collection of clinical data across the US. Using techniques to explain our models' prediction for each patient, we uncover many features that were correlated with long COVID. We also evaluate the impact of different data sources on our long COVID prediction models using a novel a cross-site analysis.