Focused Crawling

dc.contributor.authorFarag, Mohamed Magdy Ghariben
dc.contributor.authorKhan, Mohammed Saquib Akmalen
dc.contributor.authorMishra, Gauraven
dc.contributor.authorGanesh, Prasad Krishnamurthien
dc.contributor.authorCollins, Wilen
dc.contributor.authorDickerson, Willen
dc.date.accessioned2013-10-01T13:42:18Zen
dc.date.available2013-10-01T13:42:18Zen
dc.date.issued2012-12-11en
dc.description* FocusedCrawler.py, Driver class for this project, Responsible for creating configuration and classifier object and calling crawler; * crawler.py, Crawler class responsible for collecting and exploring new URLs to find relevant pages, Given a priority queue and a scoring class with a calculate_score(text) method; * classifier.py, Parent class of classifiers (non-VSM) including NaiveBayesClassifier and SVMClassifier, Contains code for tokenization and vectorization of document text using sklearn, Child classes only have to assign self.model; * config.ini, Configuration file for focused crawler in INI format; * fcconfig.py, Class responsible for reading the configuration file using ConfigParser, Adds all configuration options to its internal dictionary (e.g. config[“seedFile”]); * fcutils.py, Contains various utility functions relating to reading files and sanitizing/tokenizing text; * html_files.txt, List of local files to act as training/testing set for the classifier (“repository docs”), Default name that can be changed in configuration; * labels.txt, 1-to-1 correspondence with the lines of the repository that is assigning numerical categorical labels which are 1 for relevant or 0 for nonrelevant, Optional; * lsiscorer.py, Subclass of Scorer representing an LSI vector space model; * NBClassifier.py, Subclass of Classifier representing a Naïve Bayes classifier; * priorityQueue.py, Simple implementation of a priority queue using a heap; * scorer.py, Parent class of scorers which are non-classifier models such as (typically) VSM; * seeds.txt, Contains URLs to relevant pages for focused crawler to start, Default name that can be modified in config.ini; * SVMClassifier.py, Subclass of Classifier that is representing an SVM classifier; * tfidfscorer.py, Subclass of Scorer that is representing a tf-idf vector space model; * webpage.py, Uses BeautifulSoup and NLTK to extract webpage text; * README.txt, Documentation about the Focused Crawler and its usage.en
dc.description.abstractFinding information on the WWW is a difficult and challenging task because of the extremely large volume of content in the WWW. Search engines can be used to facilitate this task, but it is still difficult to cover all the webpages on the WWW and also to provide good results for all types of users and in all contexts. The focused crawling concept has been developed to overcome these difficulties. There are several approaches for developing a focused crawler. Classification-based approaches use classifiers in relevance estimation. Semantic-based approaches use ontologies for domain or topic representation and in relevance estimation. Link analysis approaches use text and link structure information in relevance estimation. The main differences between these approaches are: what policy is taken for crawling, how to represent the topic of interest, and how to estimate the relevance of webpages visited during crawling. We present in this report a modular architecture for focused crawling. We separated the design of the main components of focused crawling into modules to facilitate the exchange and integration of different modules. We present here a classification-based focused crawler prototype based on our modular architecture. We also describe how it can help with a particular event-oriented crawl. Note: Mr. Collins and Mr. Dickerson, in CS4624 in the spring of 2013, extended the prior work by the other co-authors from CS5604, from the fall of 2012.en
dc.description.sponsorshipNSF IIS-0916733 and IIS-1319578en
dc.identifier.urihttp://hdl.handle.net/10919/23856en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.relationhttp://hdl.handle/10919/19085en
dc.rightsCreative Commons Attribution-NonCommercial 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/us/en
dc.subjectInformation Retrievalen
dc.subjectWeb Crawlingen
dc.subjectWeb Crawleren
dc.titleFocused Crawlingen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 7
Loading...
Thumbnail Image
Name:
Focused Crawler Final Presentation.pdf
Size:
120.4 KB
Format:
Adobe Portable Document Format
Description:
Final Presentation for CS 4624 (PDF)
Name:
Focused Crawler Final Presentation.pptx
Size:
93.73 KB
Format:
Microsoft Powerpoint XML
Description:
Final Presentation for CS 4624 (PPT)
Name:
FocusedCrawlerReport.docx
Size:
260.29 KB
Format:
Microsoft Word XML
Description:
Final Technical Report (DOCX)
Loading...
Thumbnail Image
Name:
FocusedCrawlerReport.pdf
Size:
320.72 KB
Format:
Adobe Portable Document Format
Description:
Final Technical Report (PDF)
Name:
ctrnet.html
Size:
54.27 KB
Format:
Hypertext Markup Language
Description:
Boston bombings collection for CTRnet
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: