Using Transactional Web Archives To Handle Server Errors

We describe a web archiving application that handles server errors using the most recently archived representation of the requested web resource. The application is developed as an Apache module. It leverages the transactional web archiving tool SiteStory, which archives all previously accessed representations of web resources originating from a website. This application helps to improve the website's quality of service by temporarily masking server errors from the end user and gaining precious time for the system administrator to debug and recover from server failures. By providing pertinent support to website operations, we aim to reduce the resistance to transactional web archiving, which in turn may lead to a better coverage of web history.


INTRODUCTION
By estimates [1] [2], existing web archives barely scratch the surface of the total web history.The low coverage may partially be attributed to the crawler-based archiving approach predominantly used by these archives.A web crawler can only archive the content it actively fetches through the scheduled crawling.However, the change of web resources is inherently unpredictable, making it extremely difficult to interleave the crawling schedule with the changes.Observing the politeness policy further limits the crawler's ability to track changes.
More comprehensive web history may be collected by involving more stakeholders through transactional web archiving [3].A transaction is initiated by the user of a website.The archive sits between the user and the origin server and passively collects and archives the responses used to fulfill the user requests.The collective response to all these requests is a close approximation to a website's full history, or at least its memorable portion.
Despite its advantages, archiving web transactions requires cooperation from the website owner.Only the origin server has information about all requests and responses; therefore the archive needs to be part of it.But it is not easy to engage website operations staff and convince them of the archive's value.Typical IT operations are preoccupied with their immediate needs and pay little attention to services whose benefits are longer term.It is therefore crucial for us to expand the value proposition of web archiving beyond the pledged altruistic cause and make transactional web archives immediately useful to day-to-day IT operations.
In this paper we present a web archiving application intended to improve website uptime, a core quality of service indicator for web operations.It takes the most recently archived representation of a web resource to handle application server failures.Webmasters benefit from this application because they gain precious time to recover from application server failures without disrupting the majority of their users' web experience.Archivists also benefit from it because the fine archival granularity resulting from transactional archiving is impossible to attain otherwise.

ARCHITECTURE
Figure 1 illustrates the architecture of this archiving application.It assumes the typical 3-tier web application made up of 1) a frontend server, assumed to be Apache, 2) an application server, and 3) an optional database server.
The system includes SiteStory [4], a transactional web archiving tool developed by Los Alamos National Laboratory.SiteStory has two components: mod_sitestory, an Apache module installed and configured as part of the Apache frontend server, and SiteStory Web Archive, a Java application run in a Tomcat container that uses Berkeley DB to store the archived web content.
The application developed in this project, mod_uws, is similar to mod_sitestory in that it is also an Apache module.It handles web disruptions that generate HTTP 5xx error codes.These errors usually occur behind the frontend, and result from application server failure, internal network congestion and disruption, and database server bottlenecks and failures.These problems are not uncommon, and are of great concern to webmasters.When these errors occur, we assume the frontend server is still alive and working properly to generate HTTP 5xx codes.This assumption is realistic because the commonly used gateway servers, e.g., Apache, are designed to handle high workloads and have welldesigned scaling capabilities.They have been battle-hardened, and usually are more mature and robust than the other components in a web deployment.In Figure 1 the solid line denotes the transactional archiving workflow under normal working conditions, while the dotted line presents the error-handling mode when an error occurs and an HTTP 5xx response triggers our application.
As explained in [4], under normal working conditions all HTTP 200 responses are sent in parallel to both the website user and the SiteStory web archive.This archive therefore always contains the most recent server state until an error occurs.At that point an HTTP 5xx response normally would have to be sent back to the requester through the Apache frontend server.However, instead, mod_uws will detect the error, become active, and intercept the 5xx response.Then mod_uws will send a Memento request [5] to the SiteStory archive to retrieve the most recently archived copy of the requested URI.This copy will be sent back to the requester with appropriate HTTP headers modified, hence masking the server error from end users.Although currently not implemented, in theory once mod_uws is activated, it may adaptively manage any subsequent requests using algorithms like exponential backoff.This would help flatten any potential peak load and allow the application server to recover from bottlenecks.
We developed mod_uws at the level of frontend server in order to make it agnostic to various programming languages, development tools, web frameworks, and database products used to build the website.This ensures the broadest possible adoption base.Any web operation using Apache HTTP server can easily integrate and benefit from this application without significantly modifying its deployment and configurations.

FUTURE WORK
It has been shown that using SiteStory does not significantly affect the performance of Apache HTTP server [6].We will soon publish the performance test results showing how the current implementation of mod_uws affects Apache.We will also attempt to put the Memento request and response function into an external application in the hope of lessening the performance impact of mod_uws on Apache.
Larger web operations usually deploy a load balancer in front of many Apache servers.This provides us with the opportunity to move both SiteStory and the error handling application one level up to the load balancer to further improve the archiving efficiency and the ease of deployment.