Assessing the Role of Clusters Derived from Large Sequence Similarity Networks for Gene Function Predictions

Vora, Parth Harish

Assessing the Role of Clusters Derived from Large Sequence Similarity Networks for Gene Function Predictions

dc.contributor.author	Vora, Parth Harish	en
dc.contributor.committeechair	Kale, Shiv D.	en
dc.contributor.committeechair	Murali, T. M.	en
dc.contributor.committeemember	Heath, Lenwood S.	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2021-11-21T07:00:15Z	en
dc.date.available	2021-11-21T07:00:15Z	en
dc.date.issued	2020-05-29	en
dc.description.abstract	Large scale genomic sequencing efforts have resulted in a massive inflow of raw sequence data. This raw data, when appropriately processed and analyzed, can provide insight to a trained biologist and aid in hypothesis-driven research. Given the time and resource requirements necessary for biological experiments, computational predictions of gene functions can aid in reducing a large list of candidate genes to a few promising targets. Various computational solutions have been proposed and developed for gene function prediction. These solutions utilize various forms of data, such as DNA/RNA/protein sequences, protein structures, interaction networks, literature mining, and a combination of these data sources. However, these methods do not always produce precise results as the underlying data sets used for training or modeling are quite sparse. We developed and used a massive sequence similarity network build over 108 million known protein sequences to aid in protein function prediction. Predictions are made through the alignment of query sequences to representative sequences for a given cluster derived from the massive sequence similarity network. Derived clusters aggregate information (particularly that from the Gene Ontology) from respective members, which we then consolidate through a novel weighted path method. We evaluate our method on four holdout datasets using CAFA evaluation metrics. Our results suggest that clustering significantly reduces the time and memory requirements, with a marginal impact on predictive power. At lower sequence similarity thresholds, our method outperforms other gold standard methods.	en
dc.description.abstractgeneral	We often think of a protein as a nutritional requirement. However, proteins are far more than just food, they play countless and unappreciated roles in facilitating life. From transporting nutrients in the body, synthesis of hormones, functioning as enzymes to expediting chemical reactions, serving as the scaffold for cells and tissues, to protecting the body against foreign pathogens. On a molecular level, each protein is made up of chains of 20 different amino acids, just like a chain of beads, that are then folded to create a 3-dimensional structure. The variations in the ordering of amino acids result in different types of proteins. There are millions of genes across known life, and they perform different functions when translated into proteins. Nature has given us many proteins with interesting properties, and the low cost of sequencing their precursors (DNA) has resulted in large amounts of sequence data that is not yet associated with a function. Biological experiments to determine the function of a protein can be time consuming and expensive. We built a massive network encompassing 108 million protein sequences based on sequence similarity. This ensures that we make use of as much data as possible to make better predictions. Specifically, our work focuses on utilizing this information of similar proteins to aid in predicting the functions of a protein given its sequences. It is based on the idea of guilt by association, such that if two proteins are similar in sequences, they perform similar functions. We show that using computationally efficient methods and large datasets, one can achieve fast and highly precise predictions.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:26143	en
dc.identifier.uri	http://hdl.handle.net/10919/106704	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	bioinformatics	en
dc.subject	computational biology	en
dc.subject	protein sequences	en
dc.title	Assessing the Role of Clusters Derived from Large Sequence Similarity Networks for Gene Function Predictions	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Vora_PH_T_2020.pdf
Size:: 14.07 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses