ECC memory
   HOME

TheInfoList



OR:

Error correction code memory (ECC memory) is a type of
computer data storage Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a comput ...
that uses an
error correction code In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea ...
(ECC) to detect and correct n-bit
data corruption In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. ...
which occurs in memory. ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches. Typically, ECC memory maintains a memory system immune to single-bit errors: the data that is read from each
word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
is always the same as the data that had been written to it, even if one of the bits actually stored has been flipped to the wrong state. Most non-ECC memory cannot detect errors, although some non-ECC memory with parity support allows detection but not correction.


Description

Error correction codes protect against undetected data corruption and are used in computers where such corruption is unacceptable, examples being scientific and financial computing applications, or in database and file servers. ECC can also reduce the number of crashes in multi-user server applications and maximum-availability systems. Electrical or magnetic interference inside a computer system can cause a single bit of
dynamic random-access memory Dynamic random-access memory (dynamic RAM or DRAM) is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal-ox ...
(DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to
alpha particle Alpha particles, also called alpha rays or alpha radiation, consist of two protons and two neutrons bound together into a particle identical to a helium-4 nucleus. They are generally produced in the process of alpha decay, but may also be prod ...
s emitted by contaminants in chip packaging material, but research has shown that the majority of one-off
soft error In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a ...
s in DRAM chips occur as a result of
background radiation Background radiation is a measure of the level of ionizing radiation present in the environment at a particular location which is not due to deliberate introduction of radiation sources. Background radiation originates from a variety of source ...
, chiefly
neutron The neutron is a subatomic particle, symbol or , which has a neutral (not positive or negative) charge, and a mass slightly greater than that of a proton. Protons and neutrons constitute the atomic nucleus, nuclei of atoms. Since protons and ...
s from
cosmic ray Cosmic rays are high-energy particles or clusters of particles (primarily represented by protons or atomic nuclei) that move through space at nearly the speed of light. They originate from the Sun, from outside of the Solar System in our own ...
secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them. Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of
neutron flux The neutron flux, φ, is a scalar quantity used in nuclear physics and nuclear reactor physics. It is the total length travelled by all free neutrons per unit time and volume. Equivalently, it can be defined as the number of neutrons travellin ...
is 3.5 times higher at 1.5 km and 300 times higher at 10-12 km (the cruising altitude of commercial airplanes).A Survey of Techniques for Modeling and Improving Reliability of Computing Systems
, IEEE TPDS, 2015
As a result, systems operating at high altitudes require special provisions for reliability. As an example, the spacecraft ''
Cassini–Huygens ''Cassini–Huygens'' ( ), commonly called ''Cassini'', was a space-research mission by NASA, the European Space Agency (ESA), and the Italian Space Agency (ASI) to send a space probe to study the planet Saturn and its system, including its r ...
'', launched in 1997, contained two identical flight recorders, each with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Due to built-in EDAC functionality, the spacecraft's engineering telemetry reported the number of (correctable) single-bit-per-word errors and (uncorrectable) double-bit-per-word errors. During the first 2.5 years of flight, the spacecraft reported a nearly constant single-bit error rate of about 280 errors per day. However, on November 6, 1997, during the first month in space, the number of errors increased by more than a factor of four on that single day. This was attributed to a solar particle event that had been detected by the satellite GOES 9. There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently, since lower-energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.


Research

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10 error/bit·h (roughly one bit error per hour per gigabyte of memory) to 10−17 error/bit·h (roughly one bit error per millennium per gigabyte of memory). A large-scale study based on
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
's very large number of servers was presented at the SIGMETRICS/Performance '09 conference. The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year. The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a crash or to corruption of data; in large-scale production sites, memory errors are one of the most-common hardware causes of machine crashes. Memory errors can cause security vulnerabilities. A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved. A 2010 simulation study showed that, for a web browser, only a small fraction of memory errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of memory errors were greater than would be expected for independent soft errors. Some tests conclude that the isolation of
DRAM Dynamic random-access memory (dynamic RAM or DRAM) is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal-oxid ...
memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak their charges and interact electrically, as a result of high cell density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access. This effect is known as
row hammer Row hammer (also written as rowhammer) is a security exploit that takes advantage of an unintended and undesirable side effect in dynamic random-access memory (DRAM) in which memory cells interact electrically between themselves by leaking thei ...
, and it has also been used in some
privilege escalation Privilege escalation is the act of exploiting a bug, a design flaw, or a configuration oversight in an operating system or software application to gain elevated access to resources that are normally protected from an application or user. The re ...
computer security exploits. An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the character "8" (decimal value 56 in the ASCII encoding) is stored in the byte that contains the stuck bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved. As a result, the "8" (0011 1000 binary) has silently become a "9" (0011 1001).


Solutions

Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming, RAM parity memory, and ECC memory. This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record
parity Parity may refer to: * Parity (computing) ** Parity bit in computing, sets the parity of data for the purpose of error detection ** Parity flag in computing, indicates if the number of set bits is odd or even in the binary representation of the ...
or to use an
error-correcting code In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea i ...
(ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most-common error correcting code, a
single-error correction and double-error detection In computer science and telecommunication, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the sim ...
(SECDED)
Hamming code In computer science and telecommunication, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the sim ...
, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected.
Chipkill __NOTOC__ Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of ...
ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.


Implementations

Seymour Cray Seymour Roger Cray (September 28, 1925 – October 5, 1996
) was an American
parity is for farmers" when asked why he left this out of the
CDC 6600 The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM ...
. Later, he included parity in the CDC 7600, which caused pundits to remark that "apparently a lot of farmers buy computers". The original
IBM PC The IBM Personal Computer (model 5150, commonly known as the IBM PC) is the first microcomputer released in the IBM PC model line and the basis for the IBM PC compatible de facto standard. Released on August 12, 1981, it was created by a team ...
and all PCs until the early 1990s used parity checking. Later ones mostly did not. An ECC-capable memory controller can generally detect and correct errors of a single bit per
word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
(the unit of bus transfer), and detect (but not correct) errors of two bits per word. The
BIOS In computing, BIOS (, ; Basic Input/Output System, also known as the System BIOS, ROM BIOS, BIOS ROM or PC BIOS) is firmware used to provide runtime services for operating systems and programs and to perform hardware initialization during the b ...
in some computers, when matched with operating systems such as some versions of
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, whi ...
, BSD, and
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for se ...
(
Windows 2000 Windows 2000 is a major release of the Windows NT operating system developed by Microsoft and oriented towards businesses. It was the direct successor to Windows NT 4.0, and was released to manufacturing on December 15, 1999, and was offici ...
and later), allows counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic. Some DRAM chips include "internal" on-chip error correction circuits, which allow systems with non-ECC memory controllers to still gain most of the benefits of ECC memory. A. H. Johnston
"Space Radiation Effects in Advanced Flash Memories"
. NASA Electronic Parts and Packaging Program (NEPP). 2001.
In some systems, a similar effect may be achieved by using
EOS memory EOS memory (for ECC on SIMMs) is an error-correcting memory system ''built into'' SIMMs, used to upgrade server-class computers without built-in ECC memory support. The EOS SIMM itself does the error checking, with reduced need for ECC memory Er ...
modules.
Error detection and correction In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable commu ...
depends on an expectation of the kinds of errors that occur. Implicitly, it is assumed that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors. This used to be the case when memory chips were one-bit wide, what was typical in the first half of the 1980s; later developments moved many bits into the same chip. This weakness is addressed by various technologies, including IBM's
Chipkill __NOTOC__ Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of ...
,
Sun Microsystems Sun Microsystems, Inc. (Sun for short) was an American technology company that sold computers, computer components, software, and information technology services and created the Java programming language, the Solaris operating system, ZFS, t ...
' Extended ECC,
Hewlett Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company headquartered in Palo Alto, California. HP developed and provided a wide variety of hardware components ...
's Chipspare, and
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 ser ...
's
Single Device Data Correction Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared ...
(SDDC).
DRAM Dynamic random-access memory (dynamic RAM or DRAM) is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal-oxid ...
memory may provide increased protection against
soft error In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a ...
s by relying on error correcting codes. Such error-correcting memory, known as ''ECC'' or ''EDAC-protected'' memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased
radiation In physics, radiation is the emission or transmission of energy in the form of waves or particles through space or through a material medium. This includes: * ''electromagnetic radiation'', such as radio waves, microwaves, infrared, visi ...
. Some systems also " scrub" the memory, by periodically reading all addresses and writing back corrected versions if necessary to remove soft errors. Interleaving allows for distribution of the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained. Error-correcting memory controllers traditionally use
Hamming code In computer science and telecommunication, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the sim ...
s, although some use
triple modular redundancy Triple is used in several contexts to mean "threefold" or a " treble": Sports * Triple (baseball), a three-base hit * A basketball three-point field goal * A figure skating jump with three rotations * In bowling terms, three strikes in a row * I ...
(TMR). The latter is preferred because its hardware is faster than that of Hamming error correction scheme. Space satellite systems often use TMR, although satellite RAM usually uses Hamming error correction. Many early implementations of ECC memory mask correctable errors, acting "as if" the error never occurred, and only report uncorrectable errors. Modern implementations log both correctable errors (CE) and uncorrectable errors (UE). Some people proactively replace memory modules that exhibit high error rates, in order to reduce the likelihood of uncorrectable error events. Many ECC memory systems use an "external" EDAC circuit between the CPU and the memory. A few systems with ECC memory use both internal and external EDAC systems; the external EDAC system should be designed to correct certain errors that the internal EDAC system is unable to correct. Modern desktop and server CPUs integrate the EDAC circuit into the CPU, even before the shift toward CPU-integrated memory controllers, which are related to the
NUMA Nuclear mitotic apparatus protein 1 is a protein that in humans is encoded by the ''NUMA1'' gene. Interactions Nuclear mitotic apparatus protein 1 has been shown to interact with PIM1, Band 4.1, GPSM2 G-protein-signaling modulator 2, also ca ...
architecture. CPU integration enables a zero-penalty EDAC system during error-free operation. As of 2009, the most-common error-correction codes use Hamming or Hsiao codes that provide single-bit error correction and double-bit error detection (SEC-DED). Other error-correction codes have been proposed for protecting memory double-bit error correcting and triple-bit error detecting (DEC-TED) codes, single-nibble error correcting and double-nibble error detecting (SNC-DND) codes,
Reed–Solomon error correction Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960. They have many applications, the most prominent of which include consumer technologies such as MiniDiscs, CDs, DVDs, ...
codes, etc. However, in practice, multi-bit correction is usually implemented by interleaving multiple SEC-DED codes.Doe Hyun Yoon; Mattan Erez
"Memory Mapped ECC: Low-Cost Error Protection for Last Level Caches"
2009. p. 3
Early research attempted to minimize the area and delay overheads of ECC circuits. Hamming first demonstrated that SEC-DED codes were possible with one particular check matrix. Hsiao showed that an alternative matrix with odd weight columns provides SEC-DED capability with less hardware area and shorter delay than traditional Hamming SEC-DED codes. More recent research also attempts to minimize power in addition to minimizing area and delay.


Cache

Many CPUs use error-correction codes in the on-chip cache, including the Intel
Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comput ...
,
Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same ar ...
, Core and
Pentium Pentium is a brand used for a series of x86 architecture-compatible microprocessors produced by Intel. The original Pentium processor from which the brand took its name was first released on March 22, 1993. After that, the Pentium II and P ...
(since P6 microarchitecture) processors, the AMD
Athlon Athlon is the brand name applied to a series of x86-compatible microprocessors designed and manufactured by Advanced Micro Devices (AMD). The original Athlon (now called Athlon Classic) was the first seventh-generation x86 processor and the fi ...
,
Opteron Opteron is AMD's x86 former server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture (known generically as x86-64 or AMD64). It was released on April 22, 2003, with the ''Sledg ...
, all
Zen Zen ( zh, t=禪, p=Chán; ja, text= 禅, translit=zen; ko, text=선, translit=Seon; vi, text=Thiền) is a school of Mahayana Buddhism that originated in China during the Tang dynasty, known as the Chan School (''Chánzong'' 禪宗), and ...
- and
Zen+ Zen ( zh, t=禪, p=Chán; ja, text= 禅, translit=zen; ko, text=선, translit=Seon; vi, text=Thiền) is a school of Mahayana Buddhism that originated in China during the Tang dynasty, known as the Chan School (''Chánzong'' 禪宗), and ...
-based processors (
EPYC Epyc is a brand of multi-core x86-64 microprocessors designed and sold by AMD, based on the company's Zen microarchitecture. Introduced in June 2017, they are specifically targeted for the server and embedded system markets. Epyc processors share ...
, EPYC Embedded,
Ryzen Ryzen ( ) is a brand of multi-core x86-64 microprocessors designed and marketed by AMD for desktop, mobile, server, and embedded platforms based on the Zen microarchitecture. It consists of central processing units (CPUs) marketed for mainst ...
and Ryzen Threadripper), and the DEC Alpha 21264. , EDC/ECC and ECC/ECC are the two most-common cache error-protection techniques used in commercial microprocessors. The EDC/ECC technique uses an error-detecting code (EDC) in the level 1 cache. If an error is detected, data is recovered from ECC-protected level 2 cache. The ECC/ECC technique uses an ECC-protected level 1 cache and an ECC-protected level 2 cache. CPUs that use the EDC/ECC technique always write-through all STOREs to the level 2 cache, so that when an error is detected during a read from the level 1 data cache, a copy of that data can be recovered from the level 2 cache.


Registered memory

Registered, or buffered, memory is not the same as ECC; the technologies perform different functions. It is usual for memory used in servers to be both registered, to allow many memory modules to be used without electrical problems, and ECC, for data integrity. Memory used in desktop computers is usually neither, for economy. However, unbuffered (not-registered) ECC memory is available, and some non-server motherboards support ECC functionality of such modules when used with a CPU that supports ECC. Registered memory does not work reliably in motherboards without buffering circuitry, and vice versa.


Advantages and disadvantages

Ultimately, there is a trade-off between protection against unusual loss of data and a higher cost. ECC memory usually involves a higher price when compared to non-ECC memory, due to additional hardware required for producing ECC memory modules, and due to lower production volumes of ECC memory and associated system hardware. Motherboards, chipsets and processors that support ECC may also be more expensive. ECC support varies among motherboard manufacturers so ECC memory may simply not be recognized by a ECC-incompatible motherboard. Most
motherboard A motherboard (also called mainboard, main circuit board, mb, mboard, backplane board, base board, system board, logic board (only in Apple computers) or mobo) is the main printed circuit board (PCB) in general-purpose computers and other expand ...
s and processors for less critical applications are not designed to support ECC. Some ECC-enabled boards and processors are able to support unbuffered (unregistered) ECC, but will also work with non-ECC memory; system firmware enables ECC functionality if ECC memory is installed. ECC may lower memory performance by around 2–3 percent on some systems, depending on the application and implementation, due to the additional time needed for ECC memory controllers to perform error checking. However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses as long as no errors are detected.Benchmark of AMD-762/Athlon platform with and without ECC
ECC supporting memory may contribute to additional power consumption due to error correcting circuitry.


Notes


References


External links


SoftECC: A System for Software Memory Integrity Checking

A Tunable, Software-based DRAM Error Detection and Correction Library for HPC

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Single-Bit Errors: A Memory Module Supplier's perspective on cause, impact and detection

Intel Xeon Processor E3 - 1200 Product Family Memory Configuration Guide

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC
{{Primary storage technologies Computer memory Fault-tolerant computer systems