Toward an Intelligent Crawling Scheduler for Archiving News Websites Using Reinforcement Learning

Wang, Xinyue; Ahuja, Naman; Llorens, Nathaniel; Bansal, Ritesh; Dhar, Siddharth

Toward an Intelligent Crawling Scheduler for Archiving News Websites Using Reinforcement Learning

dc.contributor.author	Wang, Xinyue	en
dc.contributor.author	Ahuja, Naman	en
dc.contributor.author	Llorens, Nathaniel	en
dc.contributor.author	Bansal, Ritesh	en
dc.contributor.author	Dhar, Siddharth	en
dc.date.accessioned	2020-01-17T03:41:57Z	en
dc.date.available	2020-01-17T03:41:57Z	en
dc.date.issued	2019-12-03	en
dc.description.abstract	Web crawling is one of the fundamental activities for many kinds of web technology organizations and companies such as Internet Archive and Google. While companies like Google often focus on content delivery for users, web archiving organizations such as the Internet Archive pay more attention to the accurate preservation of the web. Crawling accuracy and efficiency are major concerns in this task. An ideal crawling module should be able to keep up with the changes in the target web site with minimal crawling frequency to maximize the routine crawling efficiency. In this project, we investigate using information from web archives' history to help the crawling process within the scope of news websites. We aim to build a smart crawling module that can predict web content change accurately both on the web page and web site structure level through modern machine learning algorithms and deep learning architectures. At the end of the project: We have collected and processed raw web archive collections from Archive.org and through our frequent crawling jobs. We have developed methods to extract identical copies of web page content and web site structure from the web archive data. We have implemented baseline models for predicting web page content change and web site structure change, web page content change with supervised machine learning algorithms; We have implemented two different reinforcement learning models for generating a web page crawling plan: a continuous prediction model and a sparse prediction model. Our results show that the reinforcement learning modal has the potential to work as an intelligent web crawling scheduler.	en
dc.description.notes	Items: Archive_team_final_report.pdf: PDF version of the final report Archive_team_final_presentation.pdf: PDF version of the final presentation Archive_team_final_presentation.pptx: PPTX version of the final presentation Report_ 6604-WebArchive_Overleaf_zip.zip: a zip of Overleaf latex files for the final report	en
dc.description.sponsorship	NSF IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/96482	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	digital library	en
dc.subject	web crawling	en
dc.subject	reinforcement learning	en
dc.subject	Machine learning	en
dc.title	Toward an Intelligent Crawling Scheduler for Archiving News Websites Using Reinforcement Learning	en
dc.type	Presentation	en
dc.type	Report	en