Toward an Intelligent Crawling Scheduler for Archiving News Websites Using Reinforcement Learning

dc.contributor.authorWang, Xinyueen
dc.contributor.authorAhuja, Namanen
dc.contributor.authorLlorens, Nathanielen
dc.contributor.authorBansal, Riteshen
dc.contributor.authorDhar, Siddharthen
dc.date.accessioned2020-01-17T03:41:57Zen
dc.date.available2020-01-17T03:41:57Zen
dc.date.issued2019-12-03en
dc.description.abstractWeb crawling is one of the fundamental activities for many kinds of web technology organizations and companies such as Internet Archive and Google. While companies like Google often focus on content delivery for users, web archiving organizations such as the Internet Archive pay more attention to the accurate preservation of the web. Crawling accuracy and efficiency are major concerns in this task. An ideal crawling module should be able to keep up with the changes in the target web site with minimal crawling frequency to maximize the routine crawling efficiency. In this project, we investigate using information from web archives' history to help the crawling process within the scope of news websites. We aim to build a smart crawling module that can predict web content change accurately both on the web page and web site structure level through modern machine learning algorithms and deep learning architectures. At the end of the project: We have collected and processed raw web archive collections from Archive.org and through our frequent crawling jobs. We have developed methods to extract identical copies of web page content and web site structure from the web archive data. We have implemented baseline models for predicting web page content change and web site structure change, web page content change with supervised machine learning algorithms; We have implemented two different reinforcement learning models for generating a web page crawling plan: a continuous prediction model and a sparse prediction model. Our results show that the reinforcement learning modal has the potential to work as an intelligent web crawling scheduler.en
dc.description.notesItems: Archive_team_final_report.pdf: PDF version of the final report Archive_team_final_presentation.pdf: PDF version of the final presentation Archive_team_final_presentation.pptx: PPTX version of the final presentation Report_ 6604-WebArchive_Overleaf_zip.zip: a zip of Overleaf latex files for the final reporten
dc.description.sponsorshipNSF IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/96482en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectdigital libraryen
dc.subjectweb crawlingen
dc.subjectreinforcement learningen
dc.subjectMachine learningen
dc.titleToward an Intelligent Crawling Scheduler for Archiving News Websites Using Reinforcement Learningen
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
Archive_team_final_presentation.pdf
Size:
1.27 MB
Format:
Adobe Portable Document Format
Name:
Archive_team_final_presentation.pptx
Size:
1.72 MB
Format:
Microsoft Powerpoint XML
Name:
Report_ 6604-WebArchive_final.zip
Size:
1.27 MB
Format:
Loading...
Thumbnail Image
Name:
Report__6604_WebArchive_final.pdf
Size:
1.19 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: