Checkpointing is a technique that provides
fault tolerance
Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.
Fault t ...
for
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computer, computing machinery. It includes the study and experimentation of algorithmic processes, and the development of both computer hardware, hardware and softw ...
systems. It involves saving a
snapshot of an
application's state, so that it can restart from that point in case of
failure
Failure is the social concept of not meeting a desirable or intended objective, and is usually viewed as the opposite of success. The criteria for failure depends on context, and may be relative to a particular observer or belief system. On ...
. This is particularly important for long-running applications that are executed in failure-prone computing systems.
Checkpointing in distributed systems
In the
distributed computing
Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers.
The components of a distributed system commu ...
environment, checkpointing is a technique that helps tolerate failures that would otherwise force a long-running application to restart from the beginning. The most basic way to implement checkpointing is to stop the application, copy all the required data from the memory to reliable storage (e.g.,
parallel file system), then continue with execution. In the case of failure, when the application restarts, it does not need to start from scratch. Rather, it will read the latest state ("the checkpoint") from the stable storage and execute from that point. While there is ongoing debate on whether checkpointing is the dominant I/O workload on distributed computing systems, the general consensus is that checkpointing is one of the major I/O workloads.
There are two main approaches for checkpointing in the distributed computing systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of
two-phase commit protocol
In transaction processing, databases, and computer networking, the two-phase commit protocol (2PC, ''tupac'') is a type of Atomic commit, atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that pa ...
algorithm. In the uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e., no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called ''
domino effect
A domino effect is the cumulative effect produced when one event sets off a series of similar or related events, a form of chain reaction. The term is an analogy to a falling row of dominoes. It typically refers to a linked sequence of events ...
'').
Implementations for applications
Save State
One of the original and now most common means of application checkpointing was a "save state" feature in interactive applications, in which the user of the application could save the state of all variables and other data and either continue working or exit the application and restart the application and restore the saved state at a later time. This was implemented through a "save" command or menu option in the application. In many cases, it became standard practice to ask the user, if they had unsaved work when exiting an application, if they wanted to save their work before doing so.
This functionality became extremely important for usability in applications in which a particular task could not be completed in one sitting (such as playing a video game expected to take dozens of hours) or in which the work was being done over a long period of time (such as data entry into a document such as rows in a spreadsheet).
The problem with save state is it requires the operator of a program to request the save. For non-interactive programs, including automated or batch processed workloads, the ability to checkpoint such applications also had to be automated.
Checkpoint/Restart
As batch applications began to handle tens to hundreds of thousands of transactions, where each transaction might process one record from one file against several different files, the need for the application to be restartable at some point without the need to rerun the entire job from scratch became imperative. Thus the "checkpoint/restart" capability was born, in which after a number of transactions had been processed, a "snapshot" or "checkpoint" of the state of the application could be taken. If the application failed before the next checkpoint, it could be restarted by giving it the checkpoint information and the last place in the transaction file where a transaction had successfully completed. The application could then restart at that point.
Checkpointing tends to be expensive, so it was generally not done with every record, but at some reasonable compromise between the cost of a checkpoint vs. the value of the computer time needed to reprocess a batch of records. Thus the number of records processed for each checkpoint might range from 25 to 200, depending on cost factors, the relative complexity of the application and the resources needed to successfully restart the application.
Fault Tolerance Interface (FTI)
FTI is a library that aims to provide computational scientists with an easy way to perform checkpoint/restart in a scalable fashion. FTI leverages local storage plus multiple replications and erasures techniques to provide several levels of reliability and performance. FTI provides application-level checkpointing that allows users to select which data needs to be protected, in order to improve efficiency and avoid space, time and energy waste. It offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.
Berkeley Lab Checkpoint/Restart (BLCR)
The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR. Their goal is to provide a robust, production quality implementation that checkpoints a wide range of applications, without requiring changes to be made to application code. BLCR focuses on checkpointing parallel applications that communicate through MPI, and on compatibility with the software suite produced by the SciDAC Scalable Systems Software ISIC. Its work is broken down into 4 main areas: Checkpoint/Restart for Linux (CR), Checkpointable MPI Libraries, Resource Management Interface to Checkpoint/Restart and Development of Process Management Interfaces.
DMTCP
DMTCP (Distributed MultiThreaded Checkpointing) is a tool for transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets. It does not modify the user's program or the operating system. Among the applications supported by DMTCP are
Open MPI
Open MPI is a Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI). It is used by many TOP500 supercomputers including Roadrunner, which was th ...
,
Python,
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language".
Perl was developed ...
, and many
programming language
A programming language is a system of notation for writing computer programs.
Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
s and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X Window applications, as long as they do not use extensions (e.g. no OpenGL or video). Among the Linux features supported by DMTCP are open
file descriptor
In Unix and Unix-like computer operating systems, a file descriptor (FD, less frequently fildes) is a process-unique identifier (handle) for a file or other input/output resource, such as a pipe or network socket.
File descriptors typically h ...
s, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and
mmap
In computing, mmap(2) is a POSIX-compliant Unix system call that maps files or devices into memory. It is a method of memory-mapped file I/O. It implements demand paging because file contents are not immediately read from disk and initially use n ...
/mprotect (including mmap-based shared memory). DMTCP supports the OFED API for InfiniBand on an experimental basis.
Collaborative checkpointing
Some recent protocols perform collaborative checkpointing by storing fragments of the checkpoint in nearby nodes. This is helpful because it avoids the cost of storing to a parallel file system (which often becomes a bottleneck for large-scale systems) and it uses storage that is closer. This has found use particularly in large-scale supercomputing clusters. The challenge is to ensure that when the checkpoint is needed when recovering from a failure, the nearby nodes with fragments of the checkpoints are available.
Docker
Docker and the underlying technology contain a checkpoint and restore mechanism.
CRIU
CRIU
Checkpoint/Restore In Userspace (CRIU) (pronounced ''kree-oo'', ), is a software tool for the Linux operating system. Using this tool, it is possible to freeze a running application (or part of it) and checkpoint it to persistent storage as a c ...
is a user space checkpoint library.
Implementation for embedded and ASIC devices
Mementos
Mementos is a software system that transforms general-purpose tasks into interruptible programs for platforms with frequent interruptions such as power outages. It was designed for batteryless embedded devices such as
RFID tag
Radio-frequency identification (RFID) uses electromagnetic fields to automatically identify and track tags attached to objects. An RFID system consists of a tiny radio transponder called a tag, a radio receiver, and a transmitter. When trig ...
s and smart cards which rely on
harvesting energy from ambient background sources. Mementos frequently senses the available energy in the system and decides whether to checkpoint the program due to impending power loss versus continuing computation. If checkpointing, data will be stored in a
non-volatile memory
Non-volatile memory (NVM) or non-volatile storage is a type of computer memory that can retain stored information even after power is removed. In contrast, volatile memory needs constant power in order to retain data.
Non-volatile memory typ ...
. When the energy becomes sufficient for
reboot
In computing, rebooting is the process by which a running computer system is restarted, either intentionally or unintentionally. Reboots can be either a cold reboot (alternatively known as a hard reboot) in which the power to the system is physi ...
, the data is retrieved from non-volatile memory and the program continues from the stored state. Mementos has been implemented on the
MSP430 family of
microcontrollers
A microcontroller (MC, uC, or μC) or microcontroller unit (MCU) is a small computer on a single integrated circuit. A microcontroller contains one or more CPUs (processor cores) along with memory and programmable input/output peripherals. Pro ...
. Mementos is named after
Christopher Nolan
Sir Christopher Edward Nolan (born 30 July 1970) is a British and American filmmaker. Known for his Cinema of the United States, Hollywood Blockbuster (entertainment), blockbusters with complex storytelling, he is considered a leading filmma ...
's
''Memento''.
Idetic
Idetic is a set of automatic tools which helps
application-specific integrated circuit
An application-specific integrated circuit (ASIC ) is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use, such as a chip designed to run in a digital voice recorder or a high-efficienc ...
(ASIC) developers automatically embed checkpoints in their designs. It targets
high-level synthesis
High-level synthesis (HLS), sometimes referred to as C synthesis, electronic system-level (ESL) synthesis, algorithmic synthesis, or behavioral synthesis, is an automated design process that takes an abstract behavioral specification of a digital ...
tools and adds the checkpoints at the
register-transfer level
In digital circuit design, register-transfer level (RTL) is a design abstraction which models a synchronous digital circuit in terms of the flow of digital signals (data) between hardware registers, and the logical operations performed on th ...
(
Verilog
Verilog, standardized as IEEE 1364, is a hardware description language (HDL) used to model electronic systems. It is most commonly used in the design and verification of digital circuits, with the highest level of abstraction being at the re ...
code). It uses a
dynamic programming approach to locate low overhead points in the
state machine
A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
of the design. Since the checkpointing in hardware level involves sending the data of dependent
register
Register or registration may refer to:
Arts, entertainment, and media
Music
* Register (music), the relative "height" or range of a note, melody, part, instrument, etc.
* ''Register'', a 2017 album by Travis Miller
* Registration (organ), ...
s to a non-volatile memory, the optimum points are required to have minimum number of registers to store. Idetic is deployed and evaluated on energy harvesting
RFID tag
Radio-frequency identification (RFID) uses electromagnetic fields to automatically identify and track tags attached to objects. An RFID system consists of a tiny radio transponder called a tag, a radio receiver, and a transmitter. When trig ...
device.
[Mirhoseini, A.; Songhori, E.M.; Koushanfar, F., "Idetic: A high-level synthesis approach for enabling long computations on transiently-powered ASICs," Pervasive Computing and Communications (PerCom), 2013 IEEE International Conference on , vol., no., pp.216,224, 18–22 March 2013 URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6526735&isnumber=6526701]
See also
*
Process image
*
Save state
A saved game (also called a game save, savegame, savefile, save point, or simply save) is a piece of digitally stored information about the progress of a player in a video game.
From the earliest games in the 1970s onward, game platform hardw ...
s, a similar concept provided by
video game console emulator
A video game console emulator is a type of emulator that allows a computing device to emulate a video game console's hardware and play its games on the emulating platform. More often than not, emulators carry additional features that surpass ...
s
References
Further reading
* Yibei Ling, Jie Mi, Xiaola Lin: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7): 699-708 (2001)
* R.E. Ahmed, R.C. Frazier, and P.N. Marinos, " Cache-Aided Rollback Error Recovery (CARER) Algorithms for Shared-Memory Multiprocessor Systems", IEEE 20th International Symposium on Fault-Tolerant Computing (FTCS-20), Newcastle upon Tyne, UK, June 26–28, 1990, pp. 82–88.
External links
LibCkptFTIBerkeley Lab Checkpoint/Restart (BLCR)Distributed MultiThreaded CheckPointing (DMTCP)OpenVZCRIUCryopid2
{{Parallel computing
Fault-tolerant computer systems