A RNA Virus Reference Database (RVRD) to Enhance Virus Detection in Metagenomic Data
With the great promise that metagenomics holds in exploring virome composition and discovering novel virus species, there is a pressing demand for comprehensive and up-to-date reference databases to enhance the downstream bioinformatics analysis. In this study, a RNA virus reference database (RVRD) was developed by manual and computational curation of RNA virus genomes downloaded from the three major virus sequence databases including NCBI, ViralZone, and ViPR. To reduce viral sequence redundancy caused by multiple identical or nearly identical sequences, sequences were first clustered and all sequences except one in a cluster that have more than 98% identity to one another were removed. Other identity cutoffs were also examined, and Hepatitis C virus genomes were studied in detail as an example. Using the 98% identity cutoff, sequences obtained from ViPR were combined with the unique RNA virus references from NCBI and ViralZone to generate the final RVRD. The resulting RVRD contained 23,085 sequences, nearly 5 times the size of NCBI RNA virus reference, and had a broad coverage of RNA virus families, with significant expansion on circular ssRNA virus and pathogenic virus families. Compared to NCBI RNA virus reference in performance evaluation, using RVRD as reference database identified more RNA virus species in RNAseq data derived from wastewater samples. Moreover, using RVRD as reference database also led to the discovery of porcine rotavirus as the etiology of unexplained diarrhea observed in pigs. RVRD is publicly available for enhancing RNA virus metagenomics.