Host-Microbe Relations: A Phylogenomics-Driven Bioinformatic Approach to the Characterization of Microbial DNA from Heterogeneous Sequence Data


TR Number



Journal Title

Journal ISSN

Volume Title


Virginia Tech


Plants and animals are characterized by intimate, enduring, often indispensable, and always complex associations with microbes. Therefore, it should come as no surprise that when the genome of a eukaryote is sequenced, a medley of bacterial sequences are produced as well. These sequences can be highly informative about the interactions between the eukaryote and its bacterial cohorts; unfortunately, they often comprise a vanishingly small constituent within a heterogeneous mixture of microbial and host sequences. Genomic analyses typically avoid the bacterial sequences in order to obtain a genome sequence for the host. Metagenomic analysis typically avoid the host sequences in order to analyze community composition and functional diversity of the bacterial component. This dissertation describes the development of a novel approach at the intersection of genomics and metagenomics, aimed at the extraction and characterization of bacterial sequences from heterogeneous sequence data using phylogenomic and bioinformatic tools.

To achieve this objective, three interoperable workflows were constructed as modular computational pipelines, with built-in checkpoints for periodic interpretation and refinement. The MetaMiner workflow uses 16S small subunit rDNA analysis to enable the systematic discovery and classification of bacteria associated with a host genome sequencing project. Using this information, the ReadMiner workflow comprehensively extracts, assembles, and characterizes sequences that belong to a target microbe. Finally, AssemblySifter examines the genes and scaffolds of the eukaryotic genome for sequences associated with the target microbe. The combined information from these three workflows is used to systemically characterize a bacterial target of interest, including robust estimation of its phylogeny, assessment of its signature profile, and determination of its relationship to the associated eukaryote.

This dissertation presents the development of the described methodology and its application to three eukaryotic genome projects. In the first study, the genomic sequences of a single, known endosymbiont was extracted from the genome sequencing data of its host. In the second study, a highly divergent endosymbiont was characterized from the assembled genome of its host. In the third study, genome sequences from a novel bacterium were extracted from both the raw sequencing data and assembled genome of a eukaryote that contained significant amounts of sequence from multiple competing bacteria. Taken together, these results demonstrate the usefulness of the described approach in singularly disparate situations, and strongly argue for a sophisticated, multifaceted, supervised approach to the characterization of host-associated microbes and their interactions.



phylogenomics, genome-mining, host-microbe interactions, genomics, bioinformatics, symbiosis, bacteria, lateral gene transfer