HOME

TheInfoList



OR:

Lockstep systems are
fault-tolerant computer system Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission critical, mission-critical, or even life-critical sys ...
s that run the same set of operations at the same time in
parallel Parallel may refer to: Mathematics * Parallel (geometry), two lines in the Euclidean plane which never intersect * Parallel (operator), mathematical operation named after the composition of electrical resistance in parallel circuits Science a ...
. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems (
dual modular redundancy In reliability engineering, dual modular redundancy (DMR) is when components of a system are duplicated, providing redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particu ...
DMR), and the error can be automatically corrected if there are at least three systems (
triple modular redundancy In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produc ...
TMR), via majority vote. The term " lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical. To run in lockstep, each system is set up to progress from one well-defined state to the next well-defined state. When a new set of inputs reaches the system, it processes them, generates new outputs and updates its state. This set of changes (new inputs, new outputs, new state) is considered to define that step, and must be treated as an atomic transaction; in other words, either all of it happens, or none of it happens, but not something in between. Sometimes a timeshift (delay) is set between systems, which increases the detection probability of errors induced by external influences (e.g.
voltage spike In electrical engineering, spikes are fast, short duration electrical transients in voltage (voltage spikes), current (current spikes), or transferred energy (energy spikes) in an electrical circuit. Fast, short duration electrical transients ...
s,
ionizing radiation Ionizing (ionising) radiation, including Radioactive decay, nuclear radiation, consists of subatomic particles or electromagnetic waves that have enough energy per individual photon or particle to ionization, ionize atoms or molecules by detaching ...
, or
in situ is a Latin phrase meaning 'in place' or 'on site', derived from ' ('in') and ' ( ablative of ''situs'', ). The term typically refers to the examination or occurrence of a process within its original context, without relocation. The term is use ...
reverse engineering Reverse engineering (also known as backwards engineering or back engineering) is a process or method through which one attempts to understand through deductive reasoning how a previously made device, process, system, or piece of software accompl ...
).


Lockstep memory

Some vendors, including Intel, use the term ''lockstep memory'' to describe a multi-channel memory layout in which
cache line A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which ...
s are distributed between two memory channels, so one half of the cache line is stored in a
DIMM A DIMM (Dual In-line Memory Module) is a popular type of memory module used in computers. It is a printed circuit board with one or both sides (front and back) holding DRAM chips and pins. The vast majority of DIMMs are manufactured in compl ...
on the first channel, while the second half goes to a DIMM on the second channel. By combining the
single error correction and double error detection In computer science and telecommunications, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the ...
(SECDED) capabilities of two ECC-enabled DIMMs in a lockstep layout, their ''single-device data correction'' (SDDC) nature can be extended into ''double-device data correction'' (DDDC), providing protection against the failure of any single memory chip. Downsides of the Intel's lockstep memory layout are the reduction of effectively usable amount of RAM (in case of a triple-channel memory layout, maximum amount of memory reduces to one third of the physically available maximum), and reduced performance of the memory subsystem.


Dual modular redundancy

Where the computing systems are duplicated, but both actively process each step, it is difficult to arbitrate between them if their outputs differ at the end of a step. For this reason, it is common practice to run DMR systems as "master/slave" configurations with the slave as a "hot-standby" to the master, rather than in lockstep. Since there is no advantage in having the slave unit actively process each step, a common method of working is for the master to copy its state at the end of each step's processing to the slave. Should the master fail at some point, the slave is ready to continue from the previous known good step. While either the lockstep or the DMR approach (when combined with some means of detecting errors in the master) can provide redundancy against hardware failure in the master, they do not protect against software error. If the master fails because of a software error, it is highly likely that the slave - in attempting to repeat the execution of the step which failed - will simply repeat the same error and fail in the same way, an example of a
common mode failure Common and special causes are the two distinct origins of variation in a process, as defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming. Briefly, "common causes", also called natural patterns, are the ...
.


Triple modular redundancy

Where the computing systems are triplicated, it becomes possible to treat them as "voting" systems. If one unit's output disagrees with the other two, it is detected as having failed. The matched output from the other two is treated as correct.


See also

*
Master-checker Master-checker or master/checker is a hardware-supported fault tolerance architecture for multiprocessor Multiprocessing (MP) is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to th ...
*
NonStop (server computers) NonStop is a series of server computers introduced to market in 1976 by Tandem Computers Inc., beginning with the NonStop product line. It was followed by the Tandem Integrity NonStop line of lock-step fault-tolerant computers, now defunct (n ...
*
Stratus VOS Stratus VOS (Virtual Operating System) is a proprietary operating system running on Stratus Technologies fault-tolerant computer systems. VOS is available on Stratus's ftServer and Continuum platforms. VOS customers use it to support high-volu ...
*
VAXft The VAXft was a family of fault-tolerant minicomputers developed and manufactured by Digital Equipment Corporation (DEC) using processors implementing the VAX instruction set architecture (ISA). "VAXft" stood for "Virtual Address Extension, fault t ...


References

{{Reflist, 30em


External links


Enabling Memory Reliability, Availability, and Serviceability Features on Dell PowerEdge Servers
2005
Chipkill correct memory architecture
August 2000, by David Locklear Classes of computers Fault-tolerant computer systems