Large-scale protein function prediction using heterogeneous ensembles
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred (https://github.com/GauravPandeyLab/LargeGOPred).