Developing an improved focused crawler for the IDEAL project
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The IDEAL (Integrated Digital Event Archive and Library) project currently has a general purpose web crawler to find articles relevant to a set of URLs the user can provide. The resulting articles are return based on frequency analysis of user provided keywords. The goal of our project is to extend the web crawler to return articles related to user provided events and other relevant information. By analyzing an article to identify key event components, such as the date, location, and type of natural disaster, we can construct a tree representation of each webpage. Next, we compute the tree edit distance between that tree, and the event tree constructed from the user’s original input. With this information we can predict webpage relevance with a higher certainty than frequency of keyword analysis provides.