Ancestral Genome Reconstruction in Bacteria
The rapid accumulation of numerous sequenced genomes has provided a golden opportunity for ancestral state reconstruction studies, especially in the whole genome reconstruction area. However, most ancestral genome reconstruction methods developed so far only focus on gene or replicon sequences instead of whole genomes. They rely largely on either detailed modeling of evolutionary events or edit distance computation, both of which can be computationally prohibitive for large data sets. Hence, most of these methods can only be applied to a small number of features and species. In this dissertation, we describe the design, implementation, and evaluation of an ancestral genome reconstruction system (REGEN) for bacteria. It is the first bacterial genome reconstruction tool that focuses on ancestral state reconstruction at the genome scale instead of the gene scale. It not only reconstructs ancestral gene content and contiguous gene runs using either a maximum parsimony or a maximum likelihood criterion but also replicon structures of each ancestor. Based on the reconstructed genomes, it can infer all major events at both the gene scale, such as insertion, deletion, and translocation, and the replicon scale, such as replicon gain, loss, and merge. REGEN finishes by producing a visual representation of the entire evolutionary history of all genomes in the study. With a model-free reconstruction method at its core, the computational requirement for ancestral genome reconstruction is reduced sufficiently for the tool to be applied to large data sets with dozens of genomes and thousands of features. To achieve as accurate a reconstruction as possible, we also develop a homologous gene family prediction tool for preprocessing. Furthermore, we build our in-house Prokaryote Genome Evolution simulator (PEGsim) for evaluation purposes. The homologous gene family prediction refinement module can refine homologous gene family predictions generated by third party de novo prediction programs by combining phylogeny and local gene synteny. We show that such refinement can be accomplished for up to 80% of homologous gene family predictions with ambiguity (mixed families). The genome evolution simulator, PEGsim, is the first random events based high level bacteria genome evolution simulator with models for all common evolutionary events at the gene, replicon, and genome scales. The concepts of conserved gene runs and horizontal gene transfer (HGT) are also built in. We show the validation of PEGsim itself and the evaluation of the last reconstruction component with simulated data produced by it. REGEN, REconstruction of GENomes, is an ancestral genome reconstruction tool based on the concept of neighboring gene pairs (NGPs). Although it does not cover the reconstruction of actual nucleotide sequences, it is capable of reconstructing gene content, contiguous genes runs, and replicon structure of each ancestor using either a maximum parsimony or a maximum likelihood criterion. Based on the reconstructed genomes, it can infer all major events at both the gene scale, such as insertion, deletion, and translocation, and the replicon scale, such as replicon gain, loss, and merge. REGEN finishes by producing a visual representation of the entire evolutionary history of all genomes in the study.