A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections

Chen, Yuxin

A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections

Files

YuxinDissertation_etd_final1.pdf (801.01 KB)

Downloads: 480

Date

2007-02-05

Authors

Chen, Yuxin

Publisher

Virginia Tech

Abstract

The Web, containing a large amount of useful information and resources, is expanding rapidly. Collecting domain-specific documents/information from the Web is one of the most important methods to build digital libraries for the scientific community. Focused Crawlers can selectively retrieve Web documents relevant to a specific domain to build collections for domain-specific search engines or digital libraries. Traditional focused crawlers normally adopting the simple Vector Space Model and local Web search algorithms typically only find relevant Web pages with low precision. Recall also often is low, since they explore a limited sub-graph of the Web that surrounds the starting URL set, and will ignore relevant pages outside this sub-graph. In this work, we investigated how to apply an inductive machine learning algorithm and meta-search technique, to the traditional focused crawling process, to overcome the above mentioned problems and to improve performance. We proposed a novel hybrid focused crawling framework based on Genetic Programming (GP) and meta-search. We showed that our novel hybrid framework can be applied to traditional focused crawlers to accurately find more relevant Web documents for the use of digital libraries and domain-specific search engines. The framework is validated through experiments performed on test documents from the Open Directory Project. Our studies have shown that improvement can be achieved relative to the traditional focused crawler if genetic programming and meta-search methods are introduced into the focused crawling process.

Keywords

meta-search, digital libraries, focused crawler, classification

Persistent link

http://hdl.handle.net/10919/26220

Collections

Doctoral Dissertations

Full item page

A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections