Assessing the Role of Clusters Derived from Large Sequence Similarity Networks for Gene Function Predictions

Vora, Parth Harish

Assessing the Role of Clusters Derived from Large Sequence Similarity Networks for Gene Function Predictions

Files

Vora_PH_T_2020.pdf (14.07 MB)

Downloads: 228

Date

2020-05-29

Authors

Vora, Parth Harish

Publisher

Virginia Tech

Abstract

Large scale genomic sequencing efforts have resulted in a massive inflow of raw sequence data. This raw data, when appropriately processed and analyzed, can provide insight to a trained biologist and aid in hypothesis-driven research. Given the time and resource requirements necessary for biological experiments, computational predictions of gene functions can aid in reducing a large list of candidate genes to a few promising targets. Various computational solutions have been proposed and developed for gene function prediction. These solutions utilize various forms of data, such as DNA/RNA/protein sequences, protein structures, interaction networks, literature mining, and a combination of these data sources. However, these methods do not always produce precise results as the underlying data sets used for training or modeling are quite sparse. We developed and used a massive sequence similarity network build over 108 million known protein sequences to aid in protein function prediction. Predictions are made through the alignment of query sequences to representative sequences for a given cluster derived from the massive sequence similarity network. Derived clusters aggregate information (particularly that from the Gene Ontology) from respective members, which we then consolidate through a novel weighted path method. We evaluate our method on four holdout datasets using CAFA evaluation metrics. Our results suggest that clustering significantly reduces the time and memory requirements, with a marginal impact on predictive power. At lower sequence similarity thresholds, our method outperforms other gold standard methods.

Keywords

bioinformatics, computational biology, protein sequences

Persistent link

http://hdl.handle.net/10919/106704

Collections

Masters Theses

Full item page