Show simple item record

dc.contributor.authorLawson, Mark Jonen_US
dc.date.accessioned2014-03-14T20:18:39Z
dc.date.available2014-03-14T20:18:39Z
dc.date.issued2009-11-02en_US
dc.identifier.otheretd-11162009-144820en_US
dc.identifier.urihttp://hdl.handle.net/10919/29623
dc.description.abstractThe rare-class data classification problem is a common one. It occurs when, in a dataset, the class of interest is far outweighed by other classes, thus making it difficult to classify using typical classification algorithms. These types of problems are found quite often in biological datasets, where data can be sparse and the class of interest has few representatives. A variety of solutions to this problem exist with varying degrees of success. In this paper, we present our solution to the rare-class problem. This solution uses MetaCost, a cost-sensitive meta-classifier, that takes in a classification algorithm, training data, and a cost matrix. This cost matrix adjusts the learning of the classification algorithm to classify more of the rare-class data but is generally unknown for a given dataset and classifier. Our method uses three different types of optimization techniques (greedy, simulated annealing, genetic algorithm) to determine this optimal cost matrix. In this paper we will show how this method can improve upon classification in a large amount of datasets, achieving better results along a variety of metrics. We will show how it can improve on different classification algorithms and do so better and more consistently than other rare-class learning techniques like oversampling and undersampling. Overall our method is a robust and effective solution to the rare-class problem.en_US
dc.publisherVirginia Techen_US
dc.relation.haspartLawson_MJ_D_2009.pdfen_US
dc.rightsI hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to Virginia Tech or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.en_US
dc.subjectLocal Searchen_US
dc.subjectBioinformaticsen_US
dc.subjectMachine Learningen_US
dc.subjectClassificationen_US
dc.titleThe Search for a Cost Matrix to Solve Rare-Class Biological Problemsen_US
dc.typeDissertationen_US
dc.contributor.departmentComputer Scienceen_US
dc.description.degreePh. D.en_US
thesis.degree.namePh. D.en_US
thesis.degree.leveldoctoralen_US
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen_US
thesis.degree.disciplineComputer Scienceen_US
dc.contributor.committeechairZhang, Liqingen_US
dc.contributor.committeememberHeath, Lenwood S.en_US
dc.contributor.committeememberRamakrishnan, Narenen_US
dc.contributor.committeememberFan, Weiguo Patricken_US
dc.contributor.committeememberWang, G. Alanen_US
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-11162009-144820/en_US
dc.date.sdate2009-11-16en_US
dc.date.rdate2009-12-10
dc.date.adate2009-12-10en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record