A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections

dc.contributor.authorChen, Yuxinen
dc.contributor.committeechairFox, Edward A.en
dc.contributor.committeememberLu, Chang-Tienen
dc.contributor.committeememberFan, Weiguo Patricken
dc.contributor.committeememberRamakrishnan, Narenen
dc.contributor.committeememberTorres, Ricardo da Silvaen
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2014-03-14T20:07:33Zen
dc.date.adate2007-03-28en
dc.date.available2014-03-14T20:07:33Zen
dc.date.issued2007-02-05en
dc.date.rdate2007-03-28en
dc.date.sdate2007-02-16en
dc.description.abstractThe Web, containing a large amount of useful information and resources, is expanding rapidly. Collecting domain-specific documents/information from the Web is one of the most important methods to build digital libraries for the scientific community. Focused Crawlers can selectively retrieve Web documents relevant to a specific domain to build collections for domain-specific search engines or digital libraries. Traditional focused crawlers normally adopting the simple Vector Space Model and local Web search algorithms typically only find relevant Web pages with low precision. Recall also often is low, since they explore a limited sub-graph of the Web that surrounds the starting URL set, and will ignore relevant pages outside this sub-graph. In this work, we investigated how to apply an inductive machine learning algorithm and meta-search technique, to the traditional focused crawling process, to overcome the above mentioned problems and to improve performance. We proposed a novel hybrid focused crawling framework based on Genetic Programming (GP) and meta-search. We showed that our novel hybrid framework can be applied to traditional focused crawlers to accurately find more relevant Web documents for the use of digital libraries and domain-specific search engines. The framework is validated through experiments performed on test documents from the Open Directory Project. Our studies have shown that improvement can be achieved relative to the traditional focused crawler if genetic programming and meta-search methods are introduced into the focused crawling process.en
dc.description.degreePh. D.en
dc.identifier.otheretd-02162007-005107en
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-02162007-005107/en
dc.identifier.urihttp://hdl.handle.net/10919/26220en
dc.publisherVirginia Techen
dc.relation.haspartYuxinDissertation_etd_final1.pdfen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectmeta-searchen
dc.subjectdigital librariesen
dc.subjectfocused crawleren
dc.subjectclassificationen
dc.titleA Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collectionsen
dc.typeDissertationen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
YuxinDissertation_etd_final1.pdf
Size:
801.01 KB
Format:
Adobe Portable Document Format