Deep Learning Based Proteomic Language Modelling for in-silico Protein Generation

TR Number
Journal Title
Journal ISSN
Volume Title
Virginia Tech

A protein is a biopolymer of amino acids that encodes a particular function. Given that there are 20 amino acids possible at each site, even a short protein of 100 amino acids has 20100 possible variants, making it unrealistic to evaluate all possible sequences in sequence level space. This search space could be reduced by considering the fact that billions of years of evolution exerting a constant pressure has left us with only a small subset of protein sequences that carry out particular cellular functions. The portion of amino acid space occupied by actual proteins found in nature is therefore much smaller than that which is possible cite{kauffman1993origins}. By examining related proteins that share a conserved function and common evolutionary history (heretofore referred to as protein families), it is possible to identify common motifs that are shared. Examination of these motifs allows us to characterize protein families in greater depth and even generate new ``in silico" proteins that are not found in nature, but exhibit properties of a particular protein family. Using novel deep learning approaches and leveraging the large volume of genomic data that is now available due to high-throughput DNA sequencing, it is now possible to examine protein families in a scale and resolution that has never before been possible. By using this abundance of data to learn high dimensional representations of amino acids sequences, in this work, we show that it is possible to generate novel sequences from a particular protein family. Such a deep sequential model-based approach has great value for bioinformatics and biotechnological applications due to its rapid sampling abilities.

Protein, Deep learning (Machine learning), Recurrent Neural Network, Auto-Regressive Models