Deep Learning Based Proteomic Language Modelling for in-silico Protein Generation

dc.contributor.authorKesavan Nair, Nitinen
dc.contributor.committeechairXuan, Jianhuaen
dc.contributor.committeechairAylward, Frank O.en
dc.contributor.committeememberAbbott, A. Lynnen
dc.contributor.departmentElectrical and Computer Engineeringen
dc.description.abstractA protein is a biopolymer of amino acids that encodes a particular function. Given that there are 20 amino acids possible at each site, even a short protein of 100 amino acids has $20^{100}$ possible variants, making it unrealistic to evaluate all possible sequences in sequence level space. This search space could be reduced by considering the fact that billions of years of evolution exerting a constant pressure has left us with only a small subset of protein sequences that carry out particular cellular functions. The portion of amino acid space occupied by actual proteins found in nature is therefore much smaller than that which is possible cite{kauffman1993origins}. By examining related proteins that share a conserved function and common evolutionary history (heretofore referred to as protein families), it is possible to identify common motifs that are shared. Examination of these motifs allows us to characterize protein families in greater depth and even generate new ``in silico" proteins that are not found in nature, but exhibit properties of a particular protein family. Using novel deep learning approaches and leveraging the large volume of genomic data that is now available due to high-throughput DNA sequencing, it is now possible to examine protein families in a scale and resolution that has never before been possible. By using this abundance of data to learn high dimensional representations of amino acids sequences, in this work, we show that it is possible to generate novel sequences from a particular protein family. Such a deep sequential model-based approach has great value for bioinformatics and biotechnological applications due to its rapid sampling abilities.en
dc.description.abstractgeneralProteins are one of the most important functional biological elements. These are composed of amino acids which link together to form different shapes which might encode a particular function. These proteins may act independently or might form ``complexes" to have a particular function. Therefore, understanding them is of utmost importance. Due to the fact that there are 20 amino acids even a protein sequence fragment of length 5 can have more than 3 million different combinations. Given, that proteins are generally 1000 amino acids long, looking at all the possibilities is next to impossible. In this work, by leveraging the ``deep learning" paradigm and the vast amount of data available, we try to model these proteins and generate new proteins belonging to a specific ``protein family." This approach has great value for bioinformatics and biotechnological applications due to its rapid sampling abilities.en
dc.description.degreeMaster of Scienceen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.subjectDeep learning (Machine learning)en
dc.subjectRecurrent Neural Networken
dc.subjectAuto-Regressive Modelsen
dc.titleDeep Learning Based Proteomic Language Modelling for in-silico Protein Generationen
dc.typeThesisen Engineeringen Polytechnic Institute and State Universityen of Scienceen
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
41.79 MB
Adobe Portable Document Format