Developing an improved focused crawler for the IDEAL project

Abstract

The IDEAL (Integrated Digital Event Archive and Library) project currently has a general purpose web crawler to find articles relevant to a set of URLs the user can provide. The resulting articles are return based on frequency analysis of user provided keywords. The goal of our project is to extend the web crawler to return articles related to user provided events and other relevant information. By analyzing an article to identify key event components, such as the date, location, and type of natural disaster, we can construct a tree representation of each webpage. Next, we compute the tree edit distance between that tree, and the event tree constructed from the user’s original input. With this information we can predict webpage relevance with a higher certainty than frequency of keyword analysis provides.

Description
CS 4624 capstone project. The client is Mohamed Magdy Gharib Farag. Support was provided through NSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL). Files provided have the final report, midterm and final presentations, a poster presented at VTURCS, and related software. Our source code can be found at: https://github.com/wbonnefond/focused-crawler
Keywords
web crawler, IDEAL, Python, natural language processing, tree-edit distance
Citation