Focused Crawling

dc.contributor.authorFarag, Mohamed Magdy Ghariben
dc.contributor.authorKhan, Mohammed Saquib Akmalen
dc.contributor.authorMishra, Gauraven
dc.contributor.authorGanesh, Prasad Krishnamurthien
dc.date.accessioned2012-12-12T01:35:11Zen
dc.date.available2012-12-12T01:35:11Zen
dc.date.issued2012-12-11en
dc.descriptionThe Crisis, Tragedy, and Recovery network (CTRnet, see external link: http://www.ctrnet.net) project makes use of general purpose crawlers, like Heritrix (see the list of like packages on p. 12 of 'Lucene in Action').However, these crawlers are strongly influenced by the quality of the seeds used, as well as other configuration details that govern the crawl.Focused crawlers typically use extra information, related to the topic of the crawl, to decide which links to follow from any page being examined. Thus, they may be able to reduce noise, and increase precision, though this may reduce recall.Focused crawling about events is particularly challenging. This project aims to explore this problem, design and implement a prototype that will improve upon existing solutions, and demonstrate its effectiveness with regard to CTRnet efforts.en
dc.description.abstractFinding information on WWW is difficult and challenging task because of the extremely large volume of the WWW. Search engine can be used to facilitate this task, but it is still difficult to cover all the webpages on the WWW and also to provide good results for all types of users and in all contexts. Focused crawling concept has been developed to overcome these difficulties. There are several approaches for developing a focused crawler. Classification-based approaches use classifiers in relevance estimation. Semantic-based approaches use ontologies for domain or topic representation and in relevance estimation. Link analysis approaches use text and link structure information in relevance estimation. The main differences between these approaches are: what policy is taken for crawling, how to represent the topic of interest, and how to estimate the relevance of webpages visited during crawling. We present in this report a modular architecture for focused crawling. We separated the design of the main components of focused crawling into modules to facilitate the exchange and integration of different modules. We will present here a classification-based focused crawler prototype based on our modular architecture.en
dc.identifier.urihttp://hdl.handle.net/10919/19085en
dc.language.isoen_USen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectFocused Crawleren
dc.subjectCrawleren
dc.subjectNaive Bayes Classifieren
dc.subjectSupport Vector Machine Classifieren
dc.titleFocused Crawlingen
dc.typeTechnical reporten
dc.typeWorking paperen

Files

Original bundle
Now showing 1 - 5 of 5
Loading...
Thumbnail Image
Name:
Technical Report on FocusedCrawler_v2.0.pdf
Size:
543.17 KB
Format:
Adobe Portable Document Format
Description:
Technical Report : Focused Crawler
Name:
Focused Crawler Project-code.zip
Size:
48.11 KB
Format:
Unknown data format
Description:
Focused Crawler - Implementation
Name:
Technical Report on FocusedCrawler_v2.0.doc
Size:
340 KB
Format:
Microsoft Word
Description:
Technical Report on FocusedCrawler_v2.0.doc
Loading...
Thumbnail Image
Name:
ProjFocusedCrawler-Dec04b.pdf
Size:
348.41 KB
Format:
Adobe Portable Document Format
Description:
ProjFocusedCrawler-Dec04b.pdf
Name:
ProjFocusedCrawler-Dec04b.pptx
Size:
213.09 KB
Format:
Microsoft Powerpoint XML
Description:
ProjFocusedCrawler-Dec04b.pptx
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: