Methods for Analysis of Prokaryotic Genome Architecture
Research in comparative microbial genomics has largely been organized around the concept of reference genomes. Reference genomes provide a useful comparative touchstone for closely related organisms. However, they do not necessarily represent the biological diversity in a group of genomes. Currently there are more than 96,000 bacterial genomes sequenced and this number is rapidly increasing. Some closely related groups have large numbers of genomes sequenced creating interesting comparative challenges: E. coli more than 5,400 isolates, S. aureus almost 9,000. As this sampling through sequencing becomes both deeper and broader, reference genome based methods become less effective at characterizing groups of organisms.
Functional motifs can help explain the organizing principles behind cellular systems in bacteria which have yet to be well understood. Currently there are relatively few bioinformatic tools for analyzing potential patterns at the level of genome organization that do not depend directly on sequence similarity. We present a framework for conducting genomic data mining to look for patterns that currently require human expert designation. We establish new computational methods for identifying patterns in prokaryotic genome construction through a mapping of genomic features, using semantic similarity, independent of a particular corpus to better approximate functional similarity.
We also present an algorithm for creating whole genome multiple sequence comparisons and a model for representing the similarities and di erences among sequences as a graph of syntenic gene families. This e ort touches on several di erent research fronts: graph representation of genomes and their alignments, synteny block analysis, whole genome sequence alignment, pan-genome analysis, multiple sequence alignment, and genome rearrangement analysis. Though our approach was originally developed from a pan-genome perspective for prokaryotes, the methods involved have the potential to speed up more expensive computation such as phylogenetic tree construction and SNP analysis. Novel elements include the contextualization of synteny analysis both between and within multi-contig genomes and an analytical framework for detecting genome level evolutionary events such as insertions, inversions, translocations, and fusions.