A RNA Virus Reference Database (RVRD) to Enhance Virus Detection in Metagenomic Data

dc.contributor.authorLei, Shaohuaen
dc.contributor.committeechairZhang, Liqingen
dc.contributor.committeememberCao, Yangen
dc.contributor.committeememberWu, Xiaoweien
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2018-10-17T08:00:23Zen
dc.date.available2018-10-17T08:00:23Zen
dc.date.issued2018-10-16en
dc.description.abstractWith the great promise that metagenomics holds in exploring virome composition and discovering novel virus species, there is a pressing demand for comprehensive and up-to-date reference databases to enhance the downstream bioinformatics analysis. In this study, a RNA virus reference database (RVRD) was developed by manual and computational curation of RNA virus genomes downloaded from the three major virus sequence databases including NCBI, ViralZone, and ViPR. To reduce viral sequence redundancy caused by multiple identical or nearly identical sequences, sequences were first clustered and all sequences except one in a cluster that have more than 98% identity to one another were removed. Other identity cutoffs were also examined, and Hepatitis C virus genomes were studied in detail as an example. Using the 98% identity cutoff, sequences obtained from ViPR were combined with the unique RNA virus references from NCBI and ViralZone to generate the final RVRD. The resulting RVRD contained 23,085 sequences, nearly 5 times the size of NCBI RNA virus reference, and had a broad coverage of RNA virus families, with significant expansion on circular ssRNA virus and pathogenic virus families. Compared to NCBI RNA virus reference in performance evaluation, using RVRD as reference database identified more RNA virus species in RNAseq data derived from wastewater samples. Moreover, using RVRD as reference database also led to the discovery of porcine rotavirus as the etiology of unexplained diarrhea observed in pigs. RVRD is publicly available for enhancing RNA virus metagenomics.en
dc.description.abstractgeneralNext-generation sequencing technology has demonstrated capability for the detection of viruses in various samples, but one challenge in bioinformatics analysis is the lack of well-curated reference databases, especially for RNA viruses. In this study, a RNA virus reference database (RVRD) was developed by manual and computational curation from the three commonly used resources: NCBI, ViralZone, and ViPR. While RVRD was managed to be comprehensive with broad coverage of RNA virus families, clustering was performed to reduce redundant sequences. The performance of RVRD was compared with NCBI RNA virus reference database using the pipeline FastViromeExplorer developed by our lab recently, the results showed that more RNA viruses were identified in several metagenomic datasets using RVRD, indicating improved performance in practice.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:17273en
dc.identifier.urihttp://hdl.handle.net/10919/85388en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectRNA virusen
dc.subjectDatabaseen
dc.subjectVirus detectionen
dc.subjectMetagenomicsen
dc.subjectClusteren
dc.titleA RNA Virus Reference Database (RVRD) to Enhance Virus Detection in Metagenomic Dataen
dc.typeThesisen
thesis.degree.disciplineComputer Science and Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Lei_S_T_2018.pdf
Size:
1.33 MB
Format:
Adobe Portable Document Format

Collections