In
reliability engineering
Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability is defined as the probability that a product, system, or service will perform its intended functi ...
, dual modular redundancy (DMR) is when components of a system are duplicated, providing
redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particularly in
fault-tolerant computer system
Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission critical, mission-critical, or even life-critical sys ...
s. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work.
DMR provides robustness to the failure of one component, and
error detection in case instruments or computers that should give the same result give different results, but does not provide
error correction
In information theory and coding theory with applications in computer science and telecommunications, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communi ...
, as ''which'' component is correct and which is malfunctioning cannot be automatically determined. There is an old adage to this effect, stating: "Never go to sea with two chronometers; take one or three."
Meaning, if two
chronometers contradict, a sailor may not know which one is reading correctly.
A
lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. Examples include
1ESS switch.
A machine with three replications of each element is termed
triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
See also
*
Hot spare
References
{{reflist
Engineering concepts
Reliability engineering
Safety
Fault-tolerant computer systems
Error detection and correction