Metamori: A library for Incremental File Checkpointing

TR Number

Date

2004-05-11

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

The advent of cluster computing has resulted in a thrust towards providing software mechanisms for reliability on clusters. The prevalent model for such mechanisms is to take a snapshot of the state of an application, called a checkpoint and commit it to stable storage. This checkpoint has sufficient meta-data, so that if the application fails, it can be restarted from the checkpoint. This operation is called a restore. In order to record a process' complete state, both its volatile and persistent state must be checkpointed. Several libraries exist for checkpointing volatile state. Some of these libraries feature incremental checkpointing, where only the changes since the last checkpoint are recorded in the next checkpoint. Such incremental checkpointing is advantageous since otherwise, the time taken for each successive checkpoint becomes larger and larger. Also, when checkpointing is done in increments, we can restore state to any of the previous checkpoints; a vital feature for adaptive applications. This thesis presents a user-level incremental checkpointing library for files: Metamori. This brings the advantages of incremental memory checkpointing to files as well, thereby providing a low-overhead approach to checkpoint persistent state. Thus, the complete state of an application can now be incrementally checkpointed, as compared to earlier approaches where volatile state was checkpointed incrementally but persistent state had no such facilities.

Description

Keywords

Fault-tolerance, File Checkpointing, Checkpointing, Rollback-Recovery

Citation

Collections