Graph-based genomic signatures

dc.contributor.authorPati, Amritaen
dc.contributor.committeechairHeath, Lenwood S.en
dc.contributor.committeememberHelm, Richard F.en
dc.contributor.committeememberRamakrishnan, Narenen
dc.contributor.committeememberShende, Anil M.en
dc.contributor.committeememberSetubal, João C.en
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2014-03-14T20:11:12Zen
dc.date.adate2008-05-14en
dc.date.available2014-03-14T20:11:12Zen
dc.date.issued2008-04-14en
dc.date.rdate2008-05-14en
dc.date.sdate2008-04-28en
dc.description.abstractGenomes have both deterministic and random aspects, with the underlying DNA sequences exhibiting features at numerous scales, from codons to regions of conserved or divergent gene order. Genomic signatures work by capturing one or more such features efficiently into a compact mathematical structure. This work examines the unique manner in which oligonucleotides fit together to comprise a genome, within a graph-theoretic setting. A de Bruijn chain (DBC) is a marriage of a de Bruijn graph and a finite Markov chain. By representing a DNA sequence as a walk over a DBC and retaining specific information at nodes and edges, we are able to obtain the de Bruijn chain genomic signature (DBCGS), based on both graph structure and the stationary distribution of the DBC. We demonstrate that DBCGS is information-rich, efficient, sufficiently representative of the sequence from which it is derived, and superior to existing genomic signatures such as the dinucleotides odds ratio and word frequency based signatures. We develop a mathematical framework to elucidate the power of the DBCGS signature to distinguish between sequences hypothesized to be generated by DBCs of distinct parameters. We study the effect of order of the DBCGS signature on accuracy while presenting relationships with genome size and genome variety. We illustrate its practical value in distinguishing genomic sequences and predicting the origin of short DNA sequences of unknown origin, while highlighting its superior performance compared to existing genomic signatures including the dinucleotides odds ratio. Additionally, we describe details of the CMGS database, a centralized repository for raw and value-added data particular to C. elegans.en
dc.description.degreePh. D.en
dc.identifier.otheretd-04282008-150624en
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-04282008-150624/en
dc.identifier.urihttp://hdl.handle.net/10919/27423en
dc.publisherVirginia Techen
dc.relation.haspartPati_Dissertation.pdfen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectMarkov chainsen
dc.subjectde Bruijn graphsen
dc.subjectGenomic signaturesen
dc.subjectDNA wordsen
dc.titleGraph-based genomic signaturesen
dc.typeDissertationen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Pati_Dissertation.pdf
Size:
8.41 MB
Format:
Adobe Portable Document Format