A machine check exception (MCE) is a type of
computer error that occurs when a problem involving the computer's
hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.
The nature and causes of MCEs can vary by
architecture
Architecture is the art and technique of designing and building, as distinguished from the skills associated with construction. It is both the process and the product of sketching, conceiving, planning, designing, and constructing buildings ...
and generation of system. In some designs, an MCE is always an unrecoverable error, that halts the machine, requiring a
reboot
In computing, rebooting is the process by which a running computer system is restarted, either intentionally or unintentionally. Reboots can be either a cold reboot (alternatively known as a hard reboot) in which the power to the system is phys ...
. In other architectures, some MCEs may be non-fatal, such as for single-bit errors corrected by
ECC memory
Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption c ...
. On some architectures, such as
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple– IBM ...
, certain software bugs can cause MCEs, such as an invalid memory access. On other architectures, such as
x86, MCEs typically originate from hardware only.
Reporting
Microsoft Windows
On
Microsoft Windows platforms, in the event of an unrecoverable MCEs, the system generates a BugCheck — also called a STOP error, or a
Blue Screen of Death.
More recent versions of Windows use the
Windows Hardware Error Architecture (WHEA), and generate STOP code 0x124, WHEA_UNCORRECTABLE_ERROR. The four parameters (in parenthesis) will vary, but the is always 0x0 for an MCE. Example:
STOP: 0x00000124 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)
Older versions of Windows use the
Machine Check Architecture
In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system.
Intel's P6 and Pentium 4 family processors, AMD's K7 and K8 family processors, as well as the Itanium ...
, with STOP code 0x9C, MACHINE_CHECK_EXCEPTION. Example:
STOP: 0x0000009C (0x00000030, 0x00000002, 0x00000001, 0x80003CBA)
Linux
On
Linux
Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which i ...
, the
kernel
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine lea ...
writes messages about MCEs to the kernel message log and the
system console
One meaning of system console, computer console, root console, operator's console, or simply console is the text entry and display device for system administration messages, particularly those from the BIOS or boot loader, the kernel, from the i ...
. When the MCEs are not fatal, they will also typically be copied to the
system log
In computing, logging is the act of keeping a log of events that occur in a computer system, such as problems, errors or just information on current operations. These events may occur in the operating system or in other software. A message or ...
and/or
systemd journal. For some systems, ECC and other correctable errors may be reported through MCE facilities.
Example:
CPU 0: Machine Check Exception: 0000000000000004
Bank 2: f200200000000863
Kernel panic: CPU context corrupt
Problem types
Most of these errors relate specifically to the
Pentium
Pentium is a brand used for a series of x86 architecture-compatible microprocessors produced by Intel. The original Pentium processor from which the brand took its name was first released on March 22, 1993. After that, the Pentium II and P ...
processor family. Similar errors may occur on other processors and will cause similar problems.
Some of the main hardware problems that cause MCEs include:
*
System bus
A system bus is a single computer bus that connects the major components of a computer system,
combining the functions of a data bus to carry information, an address bus to determine where it should be sent or read from, and a control bus to deter ...
errors: (error communicating between the processor and the
motherboard
A motherboard (also called mainboard, main circuit board, mb, mboard, backplane board, base board, system board, logic board (only in Apple computers) or mobo) is the main printed circuit board (PCB) in general-purpose computers and other expand ...
).
*
Memory
Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...
errors:
parity checking
A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are a simple form of error detecting code. Parity bits are generally applied to the smallest units of a communication protocol, typically 8-bit octets (bytes) ...
detects when a memory error has occurred.
Error correction code
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea i ...
(ECC) can correct limited memory errors so that processing can continue.
*
CPU cache
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, wh ...
errors in the processor.
Possible causes
Machine checks are a hardware problem, not a software problem. They are often the result of
overclocking
In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated spe ...
or overheating. In some cases, the CPU will shut itself off once passing a thermal limit to avoid permanent damage. But they can also be caused by bus errors introduced by other failing components, like memory or I/O devices. Possible causes include:
* Poor CPU cooling due to a
CPU heatsink and
case fans (or filters) that's clogged with dust or has come loose.
*
Overclocking
In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated spe ...
beyond the highest clock rate at which the CPU is still reliable.
* Failing motherboard.
* Failing processor.
* Failing memory.
* Failing I/O controllers, on either the motherboard or separate cards.
* Failing I/O devices.
* Inadequate or failing power supply.
Cooling problems are usually obvious upon inspection. A failing motherboard or processor can be identified by swapping them with functioning parts. Memory can be checked by booting to a diagnostic tool, like
memtest86. Non-essential failing I/O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears. If the failures typically only occur fairly soon after the OS is booted or not at all or not for days, it may be suggestive of a power supply issue. With a power supply problem, the failure often occurs when power demand peaks as the OS starts up any external devices for use.
Decoding MCEs
For IA-32 and Intel 64 processors, consult the Intel 64 and IA-32 Architectures Software Developer's Manual Chapter 15 (Machine-Check Architecture), or the Microsoft KB Article on Windows Exceptions.
Programs to decode Intel and AMD MCEs
* mcat: A Windows command-line program from
AMD to decode MCEs from
AMD K8
The AMD K8 Hammer, also code-named SledgeHammer, is a computer processor microarchitecture designed by AMD as the successor to the AMD K7 Athlon microarchitecture. The K8 was the first implementation of the AMD64 64-bit extension to the x86 i ...
, Family
0x10 and
0x11 processors.
* mcelog A
Linux
Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which i ...
daemon by Andi Kleen to handle MCEs for modern x86 processors. mcelog can also decode machine checks.
* parsemce a
Linux
Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which i ...
program by Dave Jones to decode MCEs from
AMD K7 processors.
* mced
a
Linux
Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which i ...
program by Tim Hockin to gather MCEs from the kernel and alert interested applications. Note that it does not try to interpret the MCE data, it simply alerts other programs.
See also
*
Machine check architecture
In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system.
Intel's P6 and Pentium 4 family processors, AMD's K7 and K8 family processors, as well as the Itanium ...
*
Blue screen of death
*
Kernel panic
A kernel panic (sometimes abbreviated as KP) is a safety measure taken by an operating system's kernel upon detecting an internal fatal error in which either it is unable to safely recover or continuing to run the system would have a highe ...
Notes
References
External links
mcelog: Advanced hardware error handling for x86 Linuxparsemce: Linux Machine check exception handler parser
Computer errors
{{compu-hardware-stub