Focused Crawling

Abstract

Finding information on the WWW is a difficult and challenging task because of the extremely large volume of content in the WWW. Search engines can be used to facilitate this task, but it is still difficult to cover all the webpages on the WWW and also to provide good results for all types of users and in all contexts. The focused crawling concept has been developed to overcome these difficulties. There are several approaches for developing a focused crawler. Classification-based approaches use classifiers in relevance estimation. Semantic-based approaches use ontologies for domain or topic representation and in relevance estimation. Link analysis approaches use text and link structure information in relevance estimation. The main differences between these approaches are: what policy is taken for crawling, how to represent the topic of interest, and how to estimate the relevance of webpages visited during crawling. We present in this report a modular architecture for focused crawling. We separated the design of the main components of focused crawling into modules to facilitate the exchange and integration of different modules. We present here a classification-based focused crawler prototype based on our modular architecture. We also describe how it can help with a particular event-oriented crawl. Note: Mr. Collins and Mr. Dickerson, in CS4624 in the spring of 2013, extended the prior work by the other co-authors from CS5604, from the fall of 2012.

Description

* FocusedCrawler.py, Driver class for this project, Responsible for creating configuration and classifier object and calling crawler; * crawler.py, Crawler class responsible for collecting and exploring new URLs to find relevant pages, Given a priority queue and a scoring class with a calculate_score(text) method; * classifier.py, Parent class of classifiers (non-VSM) including NaiveBayesClassifier and SVMClassifier, Contains code for tokenization and vectorization of document text using sklearn, Child classes only have to assign self.model; * config.ini, Configuration file for focused crawler in INI format; * fcconfig.py, Class responsible for reading the configuration file using ConfigParser, Adds all configuration options to its internal dictionary (e.g. config[“seedFile”]); * fcutils.py, Contains various utility functions relating to reading files and sanitizing/tokenizing text; * html_files.txt, List of local files to act as training/testing set for the classifier (“repository docs”), Default name that can be changed in configuration; * labels.txt, 1-to-1 correspondence with the lines of the repository that is assigning numerical categorical labels which are 1 for relevant or 0 for nonrelevant, Optional; * lsiscorer.py, Subclass of Scorer representing an LSI vector space model; * NBClassifier.py, Subclass of Classifier representing a Naïve Bayes classifier; * priorityQueue.py, Simple implementation of a priority queue using a heap; * scorer.py, Parent class of scorers which are non-classifier models such as (typically) VSM; * seeds.txt, Contains URLs to relevant pages for focused crawler to start, Default name that can be modified in config.ini; * SVMClassifier.py, Subclass of Classifier that is representing an SVM classifier; * tfidfscorer.py, Subclass of Scorer that is representing a tf-idf vector space model; * webpage.py, Uses BeautifulSoup and NLTK to extract webpage text; * README.txt, Documentation about the Focused Crawler and its usage.

Keywords

Information Retrieval, Web Crawling, Web Crawler

Citation