Show simple item record

dc.contributor.authorRajasimha, Harsha Karuren_US
dc.date.accessioned2014-03-14T20:50:28Z
dc.date.available2014-03-14T20:50:28Z
dc.date.issued2004-12-15en_US
dc.identifier.otheretd-12202004-135546en_US
dc.identifier.urihttp://hdl.handle.net/10919/36325
dc.description.abstractA biological pathway database is a database that describes biochemical pathways, reactions, enzymes that catalyze the reactions, and the substrates that participate in these reactions. A pathway genome database (PGDB) integrates pathway information with information about the complete genome of various sequenced organisms. Two of the popular PGDBs available today are the Kyoto Encyclopedia of Genes and Genomes (KEGG) and MetaCyc. The proliferation of biological databases in general raises several questions for the life scientist. Which of these databases is most accurate, most current, or most comprehensive? Do they have a standard format? Do they complement each other? Overall, which database should be used for what purpose? If more than one database is deemed relevant, it is desirable to have a unified database containing information from all the shortlisted databases. There is no standard methodology yet for integrating biological pathway databases and, to the best of our knowledge, no commercial software that can perform such integration tasks. While XML based pathway data exchange standards such as BioPAX and SBML are emerging, these do not address the basic problems such as inconsistent nomenclature and substrate matching between databases in the unification of pathway databases. Here, we present the PathMeld methodology to unify KEGG and MetaCyc databases starting from their flat files. Individual PGDBs are transformed into a unified schema that we design. With individual PGDBs in the common unified schema, the key to the PathMeld methodology is to find the entity correspondences between the KEGG and MetaCyc substrates. We present a heuristic driven approach for one-to-one mapping of the substrates between KEGG and MetaCyc. Using the exact name and chemical formula match criteria, 82.6% of the substrates in MetaCyc were matched accurately to corresponding substrates in KEGG. The substrate names in the MetaCyc database contain html tags and non-characters such as , , , , &, and $. The MetaCyc chemical formula are stored in lisp format in the database while KEGG stores them as continuous strings. Hence, we subject MetaCyc chemical formulae to transformation into KEGG format to make them directly comparable. Applying pre-processing to transform MetaCyc substrate names and formulae improved substrate matching by 2%. To investigate how many of the remaining 17:4% substrates are indeed absent from KEGG, we employ a standard UNIX based approximate string matching tool called agrep. The resulting matches are curated into four mutually exlusive groups: 3:83% are correct matches, 3:17% are close matches, and 7:45% are incorrect matches. 3:68% of MetaCyc substrate names are not matched at all. This shows that 11:13% of MetaCyc substrate names are absent in KEGG. We note some of the implementation issues we solved. First, parsing only one flat file to populate one database table is not sufficient. Second, intermediate database tables are needed. Third, transformation of substrate names and chemical formula from one of the component databases is required for comparison. Fourth, a biochemist's intervention is needed in evaluating the approximate substrate matches from agrep. In conclusion, the PathMeld methodology successfully uni¯es KEGG and MetaCyc °at ¯le databases into a uni¯ed PostgreSQL database. Matching substrates between databases is the key issue in the uni¯cation process. About 83% of the substrate correspondences can be computationally achieved, while the remaining 17% substrates require approximate matching and manual curation by a biochemist. We presented several di®erent techniques for substrate matching and showed that about 10% of the MetaCyc substrates do not match and hence are absent from KEGG.en_US
dc.publisherVirginia Techen_US
dc.relation.haspartRajasimha_MSCS2004.pdfen_US
dc.rightsI hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to Virginia Tech or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.en_US
dc.subjectKEGGen_US
dc.subjectEcoCycen_US
dc.subjectintegrationen_US
dc.subjectunificationen_US
dc.subjectPGDBen_US
dc.subjectPathMelden_US
dc.subjectMetaCycen_US
dc.subjectmetabolic pathway databasesen_US
dc.titlePathMeld: A Methodology for The Unification of Metabolic Pathway Databasesen_US
dc.typeThesisen_US
dc.contributor.departmentComputer Scienceen_US
dc.description.degreeMaster of Scienceen_US
thesis.degree.nameMaster of Scienceen_US
thesis.degree.levelmastersen_US
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen_US
thesis.degree.disciplineComputer Scienceen_US
dc.contributor.committeechairHeath, Lenwood S.en_US
dc.contributor.committeememberRamakrishnan, Narenen_US
dc.contributor.committeememberGrene, Ruthen_US
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-12202004-135546/en_US
dc.date.sdate2004-12-20en_US
dc.date.rdate2004-12-29
dc.date.adate2004-12-29en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record