Developing an improved focused crawler for the IDEAL project

dc.contributor.authorBonnefond, Warden
dc.contributor.authorMenzel, Chrisen
dc.contributor.authorMorris, Zacken
dc.contributor.authorPatel, Suhasen
dc.contributor.authorRitchie, Tyleren
dc.contributor.authorTedesco, Marcusen
dc.contributor.authorZheng, Franklinen
dc.date.accessioned2014-05-09T21:10:01Zen
dc.date.available2014-05-09T21:10:01Zen
dc.date.issued2014-05-09en
dc.descriptionCS 4624 capstone project. The client is Mohamed Magdy Gharib Farag. Support was provided through NSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL). Files provided have the final report, midterm and final presentations, a poster presented at VTURCS, and related software. Our source code can be found at: https://github.com/wbonnefond/focused-crawleren
dc.description.abstractThe IDEAL (Integrated Digital Event Archive and Library) project currently has a general purpose web crawler to find articles relevant to a set of URLs the user can provide. The resulting articles are return based on frequency analysis of user provided keywords. The goal of our project is to extend the web crawler to return articles related to user provided events and other relevant information. By analyzing an article to identify key event components, such as the date, location, and type of natural disaster, we can construct a tree representation of each webpage. Next, we compute the tree edit distance between that tree, and the event tree constructed from the user’s original input. With this information we can predict webpage relevance with a higher certainty than frequency of keyword analysis provides.en
dc.description.sponsorshipNSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL).en
dc.description.sponsorshipMohamed Magdy Gharib Faragen
dc.identifier.urihttp://hdl.handle.net/10919/47939en
dc.language.isoen_USen
dc.rightsCreative Commons Attribution-NonCommercial 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/us/en
dc.subjectweb crawleren
dc.subjectIDEALen
dc.subjectPythonen
dc.subjectnatural language processingen
dc.subjecttree-edit distanceen
dc.titleDeveloping an improved focused crawler for the IDEAL projecten
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 9
Name:
6604S14IDEALfocusedCrawling - Final Presentation.pptx
Size:
1.81 MB
Format:
Microsoft Powerpoint XML
Description:
Final Presentation (PowerPoint)
Loading...
Thumbnail Image
Name:
6604S14IDEALfocusedCrawling - Midterm Presentation.pdf
Size:
127.31 KB
Format:
Adobe Portable Document Format
Description:
Midterm Presentation (pdf)
Name:
focused-crawler-master.zip
Size:
87.57 MB
Format:
Unknown data format
Description:
Zip containing all relevant front-end and back-end code necessary to run both versions of the web crawler.
Name:
6604S14IDEALfocusedCrawling - Midterm Presentation.pptx
Size:
166.2 KB
Format:
Microsoft Powerpoint XML
Description:
Midterm Presentation (PowerPoint)
Loading...
Thumbnail Image
Name:
6604S14IDEALfocusedCrawling - Final Presentation.pdf
Size:
622.84 KB
Format:
Adobe Portable Document Format
Description:
Final Presentation (pdf)
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: