Crisis Events Webpages Archiving
Files
TR Number
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Webpages disappear online rapidly. When something like a crisis event occurs, it is very important to retrieve and preserve all web pages related to that event before they disappear in an effort to record their digital history. Web archiving is a technology that enables storing webpages in a format called WARC (Web ARChive). WARC records save all the pertinent information required to replay the webpage as it was online, such as the HTML data, files, and ads. The goal of this project is to implement a web archiving system that can archive a large number of web pages depending on user input. To solve this task, we have implemented two main scripts using Python due to its known scripting capabilities, and a user interface to provide a recording and replaying functionality to our system. The first script’s purpose is to go through user-given websites and archive them in the Web ARChive (WARC) format, with one WARC file per website. It is capable of accepting a URL and a collection name to direct the archived URL to, using various Python libraries and packages to do so like pywb and subprocess, as well as waiting between URLs to not overload a server with requests, avoiding being blocked by the target website(s). It operates by reading URLs from a given text file and utilizes multithreading for an overall faster performance during archival. The second script’s purpose is to replay WARC files, showing the archived webpage(s) as it initially was before archival. It is capable of accepting a WARC file (.warc or .warc.gz) inputted by the user to be displayed using the webbrowser library on the user’s own web browser. Similar Python libraries are used in this script in its implementation. The user interface was created using Node.js and React and is based on pywb’s WebUI, serving to provide the user with an easier way to use the afore- mentioned scripts. A Flask script then used to link the UI and scripts for together, allowing for greater usability, with the functions the scripts have to offer to be available for easier use. Using the UI, a user can search for an archived page using its URL and is presented with a pywb calendar to view the website capture(s) in their local browser. The final deliverables of this project include completed scripts, a user inter- face, a set of presentation slides, and a final report submitted to our professor and client, Mohamed Farag. The report and presentations show the progress our team made throughout the stages of our project’s development.