Deep Learning Based Proteomic Language Modelling for in-silico Protein Generation

Kesavan Nair, Nitin

Deep Learning Based Proteomic Language Modelling for in-silico Protein Generation

dc.contributor.author	Kesavan Nair, Nitin	en
dc.contributor.committeechair	Xuan, Jianhua	en
dc.contributor.committeechair	Aylward, Frank O.	en
dc.contributor.committeemember	Abbott, A. Lynn	en
dc.contributor.department	Electrical and Computer Engineering	en
dc.date.accessioned	2022-03-24T06:00:07Z	en
dc.date.available	2022-03-24T06:00:07Z	en
dc.date.issued	2020-09-29	en
dc.description.abstract	A protein is a biopolymer of amino acids that encodes a particular function. Given that there are 20 amino acids possible at each site, even a short protein of 100 amino acids has $20^{100}$ possible variants, making it unrealistic to evaluate all possible sequences in sequence level space. This search space could be reduced by considering the fact that billions of years of evolution exerting a constant pressure has left us with only a small subset of protein sequences that carry out particular cellular functions. The portion of amino acid space occupied by actual proteins found in nature is therefore much smaller than that which is possible cite{kauffman1993origins}. By examining related proteins that share a conserved function and common evolutionary history (heretofore referred to as protein families), it is possible to identify common motifs that are shared. Examination of these motifs allows us to characterize protein families in greater depth and even generate new ``in silico" proteins that are not found in nature, but exhibit properties of a particular protein family. Using novel deep learning approaches and leveraging the large volume of genomic data that is now available due to high-throughput DNA sequencing, it is now possible to examine protein families in a scale and resolution that has never before been possible. By using this abundance of data to learn high dimensional representations of amino acids sequences, in this work, we show that it is possible to generate novel sequences from a particular protein family. Such a deep sequential model-based approach has great value for bioinformatics and biotechnological applications due to its rapid sampling abilities.	en
dc.description.abstractgeneral	Proteins are one of the most important functional biological elements. These are composed of amino acids which link together to form different shapes which might encode a particular function. These proteins may act independently or might form ``complexes" to have a particular function. Therefore, understanding them is of utmost importance. Due to the fact that there are 20 amino acids even a protein sequence fragment of length 5 can have more than 3 million different combinations. Given, that proteins are generally 1000 amino acids long, looking at all the possibilities is next to impossible. In this work, by leveraging the ``deep learning" paradigm and the vast amount of data available, we try to model these proteins and generate new proteins belonging to a specific ``protein family." This approach has great value for bioinformatics and biotechnological applications due to its rapid sampling abilities.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:27399	en
dc.identifier.uri	http://hdl.handle.net/10919/109435	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Protein	en
dc.subject	Deep learning (Machine learning)	en
dc.subject	Recurrent Neural Network	en
dc.subject	Auto-Regressive Models	en
dc.title	Deep Learning Based Proteomic Language Modelling for in-silico Protein Generation	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Engineering	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Kesavan_Nair_N_T_2020.pdf
Size:: 41.79 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses