Crawling on the World Wide Web

Wang, Li; Fox, Edward A.

Crawling on the World Wide Web

Files

LiWangReportAccept.pdf (261.99 KB)

Downloads: 117

TR Number

TR-02-10

Date

2002

Authors

Wang, Li

Fox, Edward A.

Publisher

Department of Computer Science, Virginia Polytechnic Institute & State University

Abstract

As the World Wide Web grows rapidly, a web search engine is needed for people to search through the Web. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Given some seed URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its buffer and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. This paper describes the design, implementation, and some considerations of a new crawler programmed as an learning exercise and for possible use for experimental studies.

Keywords

Information retrieval

Persistent link

http://hdl.handle.net/10919/20052

Collections

Computer Science Technical Reports

Full item page

Crawling on the World Wide Web

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections