The Search for a Cost Matrix to Solve Rare-Class Biological Problems

dc.contributor.authorLawson, Mark Jonen
dc.contributor.committeechairZhang, Liqingen
dc.contributor.committeememberHeath, Lenwood S.en
dc.contributor.committeememberRamakrishnan, Narenen
dc.contributor.committeememberFan, Weiguo Patricken
dc.contributor.committeememberWang, Gang Alanen
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2014-03-14T20:18:39Zen
dc.date.adate2009-12-10en
dc.date.available2014-03-14T20:18:39Zen
dc.date.issued2009-11-02en
dc.date.rdate2009-12-10en
dc.date.sdate2009-11-16en
dc.description.abstractThe rare-class data classification problem is a common one. It occurs when, in a dataset, the class of interest is far outweighed by other classes, thus making it difficult to classify using typical classification algorithms. These types of problems are found quite often in biological datasets, where data can be sparse and the class of interest has few representatives. A variety of solutions to this problem exist with varying degrees of success. In this paper, we present our solution to the rare-class problem. This solution uses MetaCost, a cost-sensitive meta-classifier, that takes in a classification algorithm, training data, and a cost matrix. This cost matrix adjusts the learning of the classification algorithm to classify more of the rare-class data but is generally unknown for a given dataset and classifier. Our method uses three different types of optimization techniques (greedy, simulated annealing, genetic algorithm) to determine this optimal cost matrix. In this paper we will show how this method can improve upon classification in a large amount of datasets, achieving better results along a variety of metrics. We will show how it can improve on different classification algorithms and do so better and more consistently than other rare-class learning techniques like oversampling and under-sampling. Overall our method is a robust and effective solution to the rare-class problem.en
dc.description.degreePh. D.en
dc.identifier.otheretd-11162009-144820en
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-11162009-144820/en
dc.identifier.urihttp://hdl.handle.net/10919/29623en
dc.publisherVirginia Techen
dc.relation.haspartLawson_MJ_D_2009.pdfen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectLocal Searchen
dc.subjectBioinformaticsen
dc.subjectMachine learningen
dc.subjectClassificationen
dc.titleThe Search for a Cost Matrix to Solve Rare-Class Biological Problemsen
dc.typeDissertationen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Lawson_MJ_D_2009.pdf
Size:
968.76 KB
Format:
Adobe Portable Document Format