Error correction code memory (ECC memory) is a type of
computer data storage
Computer data storage or digital data storage is a technology consisting of computer components and Data storage, recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The cent ...
that uses an
error correction code
In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels.
The centra ...
(ECC) to detect and correct ''n''-bit
data corruption
Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of meas ...
which occurs in memory.
Typically, ECC memory maintains a memory system immune to single-bit errors: the data that is read from each
word
A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
is always the same as the data that had been written to it, even if one of the bits actually stored has been flipped to the wrong state. Most non-ECC memory cannot detect errors, although some non-ECC memory with parity support allows detection but not correction.
ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches.
Concept
Error correction codes protect against undetected data corruption and are used in computers where such corruption is unacceptable, examples being scientific and financial computing applications, or in database and file servers. ECC can also reduce the number of crashes in multi-user server applications and maximum-availability systems.
Electrical or magnetic interference inside a computer system can cause a single bit of
dynamic random-access memory
Dynamics (from Greek language, Greek δυναμικός ''dynamikos'' "powerful", from δύναμις ''dynamis'' "power (disambiguation), power") or dynamic may refer to:
Physics and engineering
* Dynamics (mechanics), the study of forces and t ...
(DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to
alpha particle
Alpha particles, also called alpha rays or alpha radiation, consist of two protons and two neutrons bound together into a particle identical to a helium-4 nucleus. They are generally produced in the process of alpha decay but may also be produce ...
s emitted by contaminants in chip packaging material, but research has shown that the majority of one-off
soft error
In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a ...
neutron
The neutron is a subatomic particle, symbol or , that has no electric charge, and a mass slightly greater than that of a proton. The Discovery of the neutron, neutron was discovered by James Chadwick in 1932, leading to the discovery of nucle ...
s from
cosmic ray
Cosmic rays or astroparticles are high-energy particles or clusters of particles (primarily represented by protons or atomic nuclei) that move through space at nearly the speed of light. They originate from the Sun, from outside of the ...
secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them. Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of
neutron flux
The neutron flux is a scalar quantity used in nuclear physics and nuclear reactor physics. It is the total distance travelled by all free neutrons per unit time and volume. Equivalently, it can be defined as the number of neutrons travelling ...
is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km (the cruising altitude of commercial airplanes).A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , IEEE TPDS, 2015. As a result, systems operating at high altitudes require special provisions for reliability.
As an example, the spacecraft ''
Cassini–Huygens
''Cassini–Huygens'' ( ), commonly called ''Cassini'', was a space research, space-research mission by NASA, the European Space Agency (ESA), and the Italian Space Agency (ASI) to send a space probe to study the planet Saturn and its system, i ...
'', launched in 1997, contained two identical flight recorders, each with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips. Due to built-in EDAC functionality, the spacecraft's engineering telemetry reported the number of (correctable) single-bit-per-word errors and (uncorrectable) double-bit-per-word errors. During the first 2.5 years of flight, the spacecraft reported a nearly constant single-bit error rate of about 280 errors per day. However, on November 6, 1997, during the first month in space, the number of errors increased by more than a factor of four on that single day. This was attributed to a
solar particle event
In solar physics, a solar particle event (SPE), also known as a solar energetic particle event or solar radiation storm, is a solar phenomenon which occurs when particles emitted by the Sun, mostly protons, become accelerated either in the Sun's ...
that had been detected by the satellite GOES 9.
There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently, since lower-energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as
SOI
In Thailand, a ''soi'' ( ) is a side street that branches off of a major street (''thanon'', ). An alley is called a ''trok'' ().
Overview
Sois are usually numbered, and are referred to by the name of the major street and the number, as in "S ...
may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry, and previous concerns over increasing bit cell error rates are unfounded.
Research
Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from , roughly one bit error per hour per gigabyte of memory, to , roughly one bit error per millennium per gigabyte of memory. A large-scale study based on
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
's very large number of servers was presented at the SIGMETRICS/Performance '09 conference. The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 () and 70,000 (, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year.
The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a crash or to corruption of data; in large-scale production sites, memory errors are one of the most-common hardware causes of machine crashes. Memory errors can cause security vulnerabilities. A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved. A 2010 simulation study showed that, for a web browser, only a small fraction of memory errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of memory errors were greater than would be expected for independent soft errors.
Some tests conclude that the isolation of
DRAM
Dram, DRAM, or drams may refer to:
Technology and engineering
* Dram (unit), a unit of mass and volume, and an informal name for a small amount of liquor, especially whisky or whiskey
* Dynamic random-access memory, a type of electronic semicondu ...
memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak their charges and interact electrically, as a result of high cell density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access. This effect is known as
row hammer
Rowhammer (also written as row hammer or RowHammer) is a computer security exploit that takes advantage of an unintended and undesirable side effect in dynamic random-access memory (DRAM) in which memory cell (computing), memory cells interact e ...
, and it has also been used in some
privilege escalation
Privilege escalation is the act of exploiting a Software bug, bug, a Product defect, design flaw, or a configuration oversight in an operating system or software application to gain elevated access to resource (computer science), resources that ar ...
computer security exploits.
An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking or be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the character "8" (decimal value 56 in the ASCII encoding) is stored in the byte that contains the stuck bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved. As a result, the "8" (0011 1000 binary) has silently become a "9" (0011 1001).
Solutions
Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming,
RAM parity RAM parity checking is the storing of a redundant parity bit representing the parity (odd or even) of a small amount of computer data (typically one byte) stored in random-access memory, and the subsequent comparison of the stored and the computed ...
memory, and ECC memory.
This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an
error-correcting code
In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels.
The centra ...
(ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most-common error correcting code, a single-error correction and double-error detection (SECDED)
Hamming code
In computer science and telecommunications, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the ...
, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected.
Chipkill __NOTOC__
Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects memory systems from single memory chip failures and multi-bit errors from any portion of a single memory chip. ...
ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.
Implementations
Seymour Cray
Seymour Roger Cray (September 28, 1925 – October 5, 1996) – was an American
parity is for farmers" when asked why he left this out of the CDC 6600. Later, he included parity in the CDC 7600, which caused pundits to remark that "apparently a lot of farmers buy computers". The original IBM PC and all PCs until the early 1990s used parity checking. Later ones mostly did not.
An ECC-capable memory controller can generally detect and correct errors of a single bit per
word
A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...
(the unit of
bus
A bus (contracted from omnibus, with variants multibus, motorbus, autobus, etc.) is a motor vehicle that carries significantly more passengers than an average car or van, but fewer than the average rail transport. It is most commonly used ...
transfer), and detect (but not correct) errors of two bits per word. The
BIOS
In computing, BIOS (, ; Basic Input/Output System, also known as the System BIOS, ROM BIOS, BIOS ROM or PC BIOS) is a type of firmware used to provide runtime services for operating systems and programs and to perform hardware initialization d ...
in some computers, when matched with operating systems such as some versions of
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
,
BSD
The Berkeley Software Distribution (BSD), also known as Berkeley Unix or BSD Unix, is a discontinued Unix operating system developed and distributed by the Computer Systems Research Group (CSRG) at the University of California, Berkeley, beginni ...
, and
Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
(
Windows 2000
Windows 2000 is a major release of the Windows NT operating system developed by Microsoft, targeting the server and business markets. It is the direct successor to Windows NT 4.0, and was Software release life cycle#Release to manufacturing (RT ...
and later), allows counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.
Some DRAM chips include "internal" on-chip error-correction circuits, which allow systems with non-ECC memory controllers to still gain most of the benefits of ECC memory.A. H. Johnston "Space Radiation Effects in Advanced Flash Memories" . NASA Electronic Parts and Packaging Program (NEPP). 2001. In some systems, a similar effect may be achieved by using EOS memory modules.
Error detection and correction
In information theory and coding theory with applications in computer science and telecommunications, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communi ...
depends on an expectation of the kinds of errors that occur. Implicitly, it is assumed that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors. This used to be the case when memory chips were one-bit wide, what was typical in the first half of the 1980s; later developments moved many bits into the same chip. This weakness is addressed by various technologies, including
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
's
Chipkill __NOTOC__
Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects memory systems from single memory chip failures and multi-bit errors from any portion of a single memory chip. ...
,
Sun Microsystems
Sun Microsystems, Inc., often known as Sun for short, was an American technology company that existed from 1982 to 2010 which developed and sold computers, computer components, software, and information technology services. Sun contributed sig ...
Hewlett-Packard
The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company. It was founded by Bill Hewlett and David Packard in 1939 in a one-car garage in Palo Alto, California ...
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
's
Single Device Data Correction
Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in Parallel computing, parallel. The Redundancy (engineering), redundancy (duplication) allows error detection and error correction: the ou ...
(SDDC).
DRAM
Dram, DRAM, or drams may refer to:
Technology and engineering
* Dram (unit), a unit of mass and volume, and an informal name for a small amount of liquor, especially whisky or whiskey
* Dynamic random-access memory, a type of electronic semicondu ...
memory may provide increased protection against
soft error
In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a ...
s by relying on error-correcting codes. Such error-correcting memory, known as ''ECC'' or ''EDAC-protected'' memory, is particularly desirable for highly fault-tolerant applications, such as servers, as well as deep-space applications due to increased
radiation
In physics, radiation is the emission or transmission of energy in the form of waves or particles through space or a material medium. This includes:
* ''electromagnetic radiation'' consisting of photons, such as radio waves, microwaves, infr ...
. Some systems also " scrub" the memory, by periodically reading all addresses and writing back corrected versions if necessary to remove soft errors.
Interleaving allows distribution of the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single-event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error-correcting code), and an effectively error-free memory system may be maintained.
Error-correcting memory controllers traditionally use
Hamming code
In computer science and telecommunications, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the ...
s, although some use
triple modular redundancy
In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produc ...
(TMR). The latter is preferred because its hardware is faster than that of Hamming error-correction scheme. Space satellite systems often use TMR, although satellite RAM usually uses Hamming error correction.
Many early implementations of ECC memory mask correctable errors, acting "as if" the error never occurred, and only report uncorrectable errors. Modern implementations log both correctable errors (CE) and uncorrectable errors (UE). Some people proactively replace memory modules that exhibit high error rates, in order to reduce the likelihood of uncorrectable error events.
Many ECC memory systems use an "external" EDAC circuit between the CPU and the memory. A few systems with ECC memory use both internal and external EDAC systems; the external EDAC system should be designed to correct certain errors that the internal EDAC system is unable to correct. Modern desktop and server CPUs integrate the EDAC circuit into the CPU, even before the shift toward CPU-integrated memory controllers, which are related to the
NUMA
Numa or NUMA may refer to:
* Non-uniform memory access (NUMA), in computing
Places
* Numa Falls, a waterfall in Kootenay National Park, Canada
* 15854 Numa, a main-belt asteroid
United States
* Numa, Indiana
* Numa, Iowa
* Numa, Oklahoma
* ...
architecture. CPU integration enables a zero-penalty EDAC system during error-free operation.
As of 2009, the most-common error-correction codes use Hamming or Hsiao codes that provide single-bit error correction and double-bit error detection (SEC-DED). Other error-correction codes have been proposed for protecting memory double-bit error correcting and triple-bit error detecting (DEC-TED) codes, single-nibble error correcting and double-nibble error detecting (SNC-DND) codes,
Reed–Solomon error correction
In information theory and coding theory, Reed–Solomon codes are a group of error-correcting codes that were introduced by Irving S. Reed and Gustave Solomon in 1960.
They have many applications, including consumer technologies such as MiniDiscs, ...
codes, etc. However, in practice, multi-bit correction is usually implemented by interleaving multiple SEC-DED codes.Doe Hyun Yoon; Mattan Erez "Memory Mapped ECC: Low-Cost Error Protection for Last Level Caches" 2009. p. 3.
Early research attempted to minimize the area and delay overheads of ECC circuits. Hamming first demonstrated that SEC-DED codes were possible with one particular check matrix. Hsiao showed that an alternative matrix with odd-weight columns provides SEC-DED capability with less hardware area and shorter delay than traditional Hamming SEC-DED codes. More recent research also attempts to minimize power in addition to minimizing area and delay.
Cache
Many CPUs use error-correction codes in the on-chip cache, including the Intel
Itanium
Itanium (; ) is a discontinued family of 64-bit computing, 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly dev ...
,
Xeon
Xeon (; ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded markets. It was introduced in June 1998. Xeon processors are based on the same archite ...
,
Core
Core or cores may refer to:
Science and technology
* Core (anatomy), everything except the appendages
* Core (laboratory), a highly specialized shared research resource
* Core (manufacturing), used in casting and molding
* Core (optical fiber ...
and
Pentium
Pentium is a series of x86 architecture-compatible microprocessors produced by Intel from 1993 to 2023. The Pentium (original), original Pentium was Intel's fifth generation processor, succeeding the i486; Pentium was Intel's flagship proce ...
Athlon
AMD Athlon is the brand name applied to a series of x86, x86-compatible microprocessors designed and manufactured by AMD, Advanced Micro Devices. The original Athlon (now called Athlon Classic) was the first seventh-generation x86 processor a ...
,
Opteron
Opteron is AMD's x86 former server and workstation Microprocessor, processor line, and was the first processor which supported the AMD64 instruction set architecture (known generically as x86-64). It was released on April 22, 2003, with the ''Sl ...
, all
Zen
Zen (; from Chinese: ''Chán''; in Korean: ''Sŏn'', and Vietnamese: ''Thiền'') is a Mahayana Buddhist tradition that developed in China during the Tang dynasty by blending Indian Mahayana Buddhism, particularly Yogacara and Madhyamaka phil ...
- and
Zen+
Zen+ is the name for a computer processor microarchitecture by AMD. It is the successor to the first gen Zen microarchitecture, and was first released in April 2018, powering the second generation of Ryzen processors, known as Ryzen 2000 for mai ...
-based processors (
EPYC
Epyc (stylized as EPYC) is a brand of multi-core x86-64 microprocessors designed and sold by AMD, based on the company's Zen microarchitecture. Introduced in June 2017, they are specifically targeted for the server and embedded system market ...
Ryzen
Ryzen ( ) is a brand of multi-core x86-64 microprocessors, designed and marketed by AMD for desktop, mobile, server, and embedded platforms, based on the Zen microarchitecture. It consists of central processing units (CPUs) marketed for mai ...
and
Ryzen Threadripper
Threadripper, or Ryzen Threadripper, is a brand of HEDT (high-end desktop) and workstation multi-core x86-64 microprocessors designed and marketed by Advanced Micro Devices (AMD), and based on the Zen microarchitecture. It consists of central pro ...
), and the DEC Alpha 21264.
, EDC/ECC and ECC/ECC are the two most-common cache error-protection techniques used in commercial microprocessors. The EDC/ECC technique uses an error-detecting code (EDC) in the level 1 cache. If an error is detected, data is recovered from ECC-protected level 2 cache. The ECC/ECC technique uses an ECC-protected level 1 cache and an ECC-protected level 2 cache. CPUs that use the EDC/ECC technique always
write-through
In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsew ...
all STOREs to the level 2 cache, so that when an error is detected during a read from the level 1 data cache, a copy of that data can be recovered from the level 2 cache.
Registered memory
Registered, or buffered, memory is not the same as ECC; the technologies perform different functions. It is usual for memory used in servers to be both registered, to allow many memory modules to be used without electrical problems, and ECC, for data integrity.
Advantages and disadvantages
Ultimately, there is a trade-off between protection against unusual loss of data and a higher cost.
ECC memory usually costs more than non-ECC memory, due to additional hardware required for producing ECC memory modules, and due to lower production volumes of ECC memory and associated system hardware. Motherboards,
chipset
In a computer system, a chipset is a set of electronic components on one or more integrated circuits that manages the data flow between the processor, memory and peripherals. The chipset is usually found on the motherboard of computers. Chips ...
s and processors that support ECC may also be more expensive.
ECC support varies among motherboard manufacturers, so ECC memory may simply not be recognized by an ECC-incompatible motherboard. Most
motherboard
A motherboard, also called a mainboard, a system board, a logic board, and informally a mobo (see #Nomenclature, "Nomenclature" section), is the main printed circuit board (PCB) in general-purpose computers and other expandable systems. It ho ...
s and processors for less critical applications are not designed to support ECC. Some ECC-enabled boards and processors are able to support unbuffered (unregistered) ECC, but will also work with non-ECC memory; system firmware enables ECC functionality if ECC memory is installed.
ECC may lower memory performance by around 2–3 percent on some systems, depending on the application and implementation, due to the additional time needed for ECC memory controllers to perform error checking. However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses as long as no errors are detected.
This is not the case for in-band ECC, which stores tables used for protection in a reserved region of main system memory, supported by
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
for
Chromebook
Chromebook (sometimes stylized in lowercase as chromebook) is a line of laptops, desktops, tablets and all-in-one computers that run ChromeOS, a proprietary operating system developed by Google.
Chromebooks are optimised for web access. They al ...
s, which showed little impact on
web browsing
Web navigation refers to the process of navigating a Computer network, network of web resource, information resources in the International World Wide Web Conference, World Wide Web, which is organized as hypertext or hypermedia. The user interfac ...
and productivity tasks, but caused up to a 25% reduction in
gaming
Gaming may refer to:
Games and sports
The act of playing games, as in:
* Legalized gambling, playing games of chance for money, often referred to in law as "gaming"
* Playing a role-playing game, in which players assume fictional roles
* Playing ...
and
video editing
Video editing is the post-production and arrangement of video shots. To showcase excellent video editing to the public, video editors must be reasonable and ensure they have a thorough understanding of film, television, and other sorts of videog ...
benchmarks.
ECC supporting memory may contribute to additional power consumption due to error-correcting circuitry.