Focused Crawling

Farag, Mohamed Magdy Gharib; Khan, Mohammed Saquib Akmal; Mishra, Gaurav; Ganesh, Prasad Krishnamurthi; Collins, Wil; Dickerson, Will

Focused Crawling

dc.contributor.author	Farag, Mohamed Magdy Gharib	en
dc.contributor.author	Khan, Mohammed Saquib Akmal	en
dc.contributor.author	Mishra, Gaurav	en
dc.contributor.author	Ganesh, Prasad Krishnamurthi	en
dc.contributor.author	Collins, Wil	en
dc.contributor.author	Dickerson, Will	en
dc.date.accessioned	2013-10-01T13:42:18Z	en
dc.date.available	2013-10-01T13:42:18Z	en
dc.date.issued	2012-12-11	en
dc.description	* FocusedCrawler.py, Driver class for this project, Responsible for creating configuration and classifier object and calling crawler; * crawler.py, Crawler class responsible for collecting and exploring new URLs to find relevant pages, Given a priority queue and a scoring class with a calculate_score(text) method; * classifier.py, Parent class of classifiers (non-VSM) including NaiveBayesClassifier and SVMClassifier, Contains code for tokenization and vectorization of document text using sklearn, Child classes only have to assign self.model; * config.ini, Configuration file for focused crawler in INI format; * fcconfig.py, Class responsible for reading the configuration file using ConfigParser, Adds all configuration options to its internal dictionary (e.g. config[“seedFile”]); * fcutils.py, Contains various utility functions relating to reading files and sanitizing/tokenizing text; * html_files.txt, List of local files to act as training/testing set for the classifier (“repository docs”), Default name that can be changed in configuration; * labels.txt, 1-to-1 correspondence with the lines of the repository that is assigning numerical categorical labels which are 1 for relevant or 0 for nonrelevant, Optional; * lsiscorer.py, Subclass of Scorer representing an LSI vector space model; * NBClassifier.py, Subclass of Classifier representing a Naïve Bayes classifier; * priorityQueue.py, Simple implementation of a priority queue using a heap; * scorer.py, Parent class of scorers which are non-classifier models such as (typically) VSM; * seeds.txt, Contains URLs to relevant pages for focused crawler to start, Default name that can be modified in config.ini; * SVMClassifier.py, Subclass of Classifier that is representing an SVM classifier; * tfidfscorer.py, Subclass of Scorer that is representing a tf-idf vector space model; * webpage.py, Uses BeautifulSoup and NLTK to extract webpage text; * README.txt, Documentation about the Focused Crawler and its usage.	en
dc.description.abstract	Finding information on the WWW is a difficult and challenging task because of the extremely large volume of content in the WWW. Search engines can be used to facilitate this task, but it is still difficult to cover all the webpages on the WWW and also to provide good results for all types of users and in all contexts. The focused crawling concept has been developed to overcome these difficulties. There are several approaches for developing a focused crawler. Classification-based approaches use classifiers in relevance estimation. Semantic-based approaches use ontologies for domain or topic representation and in relevance estimation. Link analysis approaches use text and link structure information in relevance estimation. The main differences between these approaches are: what policy is taken for crawling, how to represent the topic of interest, and how to estimate the relevance of webpages visited during crawling. We present in this report a modular architecture for focused crawling. We separated the design of the main components of focused crawling into modules to facilitate the exchange and integration of different modules. We present here a classification-based focused crawler prototype based on our modular architecture. We also describe how it can help with a particular event-oriented crawl. Note: Mr. Collins and Mr. Dickerson, in CS4624 in the spring of 2013, extended the prior work by the other co-authors from CS5604, from the fall of 2012.	en
dc.description.sponsorship	NSF IIS-0916733 and IIS-1319578	en
dc.identifier.uri	http://hdl.handle.net/10919/23856	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.relation	http://hdl.handle/10919/19085	en
dc.rights	Creative Commons Attribution-NonCommercial 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/us/	en
dc.subject	Information Retrieval	en
dc.subject	Web Crawling	en
dc.subject	Web Crawler	en
dc.title	Focused Crawling	en
dc.type	Technical report	en