Missing genes in the annotation of prokaryotic genomes

dc.contributor.authorWarren, Andrew S.en
dc.contributor.authorArchuleta, Jeremyen
dc.contributor.authorFeng, Wu-chunen
dc.contributor.authorSetubal, João C.en
dc.contributor.departmentComputer Scienceen
dc.contributor.departmentFralin Life Sciences Instituteen
dc.date.accessioned2012-08-24T11:25:03Zen
dc.date.available2012-08-24T11:25:03Zen
dc.date.issued2010-03-15en
dc.date.updated2012-08-24T11:25:03Zen
dc.description.abstractBackground Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes. Results We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. Conclusions Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.en
dc.description.sponsorshipIBM Faculty Award: VTF- 873901en
dc.description.sponsorshipVirginia Bioinformatics Instituteen
dc.description.versionPublished versionen
dc.format.mimetypeapplication/pdfen
dc.identifier.citationBMC Bioinformatics. 2010 Mar 15;11(1):131en
dc.identifier.doihttps://doi.org/10.1186/1471-2105-11-131en
dc.identifier.urihttp://hdl.handle.net/10919/18847en
dc.identifier.volume11en
dc.language.isoenen
dc.publisherBioMed Centralen
dc.rightsCreative Commons Attribution 4.0 Internationalen
dc.rights.holderAndrew S Warren et al.; licensee BioMed Central Ltd.en
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/en
dc.titleMissing genes in the annotation of prokaryotic genomesen
dc.title.serialBMC Bioinformaticsen
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 5 of 7
Loading...
Thumbnail Image
Name:
1471-2105-11-131.pdf
Size:
452.41 KB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
1471-2105-11-131-S1.PDF
Size:
40.58 KB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
1471-2105-11-131-S2.PDF
Size:
135.46 KB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
1471-2105-11-131-S3.PDF
Size:
34.82 KB
Format:
Adobe Portable Document Format
Name:
1471-2105-11-131-S4.TXT
Size:
134.37 KB
Format:
Plain Text
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: