A machine check exception (MCE) is a type of
computer error that occurs when a problem involving the computer's
hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.
The nature and causes of MCEs can vary by
architecture
Architecture is the art and technique of designing and building, as distinguished from the skills associated with construction. It is both the process and the product of sketching, conceiving, planning, designing, and construction, constructi ...
and generation of system. In some designs, an MCE is always an unrecoverable error, that halts the machine, requiring a
reboot
In computing, rebooting is the process by which a running computer system is restarted, either intentionally or unintentionally. Reboots can be either a cold reboot (alternatively known as a hard reboot) in which the power to the system is physi ...
. In other architectures, some MCEs may be non-fatal, such as for single-bit errors corrected by
ECC memory
Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code (ECC) to detect and correct ''n''-bit data corruption which occurs in memory.
Typically, ECC memory maintains a memory system immun ...
. On some architectures, such as
PowerPC
PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
, certain software bugs can cause MCEs, such as an invalid memory access. On other architectures, such as
x86
x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
, MCEs typically originate from hardware only.
Reporting
IBM mainframe operating systems
IBM System/360 Operating System (
OS/360
OS/360, officially known as IBM System/360 Operating System, is a discontinued batch processing operating system developed by IBM for their then-new System/360 mainframe computer, announced in 1964; it was influenced by the earlier IBSYS/IBJOB a ...
) records input/output errors in a dataset called SYS1.LOGREC. Since then IBM has coined the term ''error recording data set'' (''ERDS'') for successor versions that allow the installation to choose the name and for operating systems not derived from OS/360.
OS/360
In OS/360, the installation can choose several levels of support for handling machine checks. The most sophisticated, Machine Check Handler (MCH), records failure data on SYS1.LOGREC and attempts recovery. The installation can print those data using the Environmental Record Editing and Printing Program (EREP) service aid or the stand-alone version SEREP. The MCH can handle memory failures in
refreshable nucleus control sections by reading a fresh copy from SYS1.ASRLIB and can handle memory errors in SVC transient areas by reading a fresh copy of the SVC module from SYS1.SVCLIB.
z/OS
In z/OS the installation can either use an ERDS or can define a z/OS System Logger log stream to hold the error data. As with OS/360, the installation uses EREP to print those data; SEREP is no longer available. The MCH is no longer optional, and handles many more failure modes than the OS/360 MCH.
Microsoft Windows
On
Microsoft Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
platforms, in the event of an unrecoverable MCE, the system generates a BugCheck — also called a STOP error, or a
Blue Screen of Death
The blue screen of death (BSoD) or blue screen error, blue screen, fatal error, bugcheck, and officially known as a stop erroris a fatal system error, critical error screen displayed by the Microsoft Windows operating systems to indicate a cr ...
.
More recent versions of Windows use the
Windows Hardware Error Architecture (WHEA), and generate STOP code 0x124, WHEA_UNCORRECTABLE_ERROR. The four parameters (in parentheses) will vary, but the first is always 0x0 for an MCE. Example:
STOP: 0x00000124 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)
Older versions of Windows use the
Machine Check Architecture
In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system.
Intel's P6 and Pentium 4 family processors, AMD's K7 and K8 family processors, as well as the Itanium ...
, with STOP code 0x9C, MACHINE_CHECK_EXCEPTION. Example:
STOP: 0x0000009C (0x00000030, 0x00000002, 0x00000001, 0x80003CBA)
Linux
On
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
, the
kernel
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learnin ...
writes messages about MCEs to the kernel message log and the
system console
A computer terminal is an electronic or electromechanical computer hardware, hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. Most early computers only had a front panel to ...
. When the MCEs are not fatal, they will also typically be copied to the
system log and/or
systemd journal. For some systems, ECC and other correctable errors may be reported through MCE facilities.
Example:
CPU 0: Machine Check Exception: 0000000000000004
Bank 2: f200200000000863
Kernel panic: CPU context corrupt
Problem types
Some of the main hardware problems that cause MCEs include:
*
System bus
A system bus is a single computer bus that connects the major components of a computer system,
combining the functions of a data bus to carry information, an address bus to determine where it should be sent or read from, and a control bus to det ...
errors: (error communicating between the processor and the
motherboard
A motherboard, also called a mainboard, a system board, a logic board, and informally a mobo (see #Nomenclature, "Nomenclature" section), is the main printed circuit board (PCB) in general-purpose computers and other expandable systems. It ho ...
).
*
Memory
Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembe ...
errors:
parity checking detects when a memory error has occurred.
Error correction code
In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels.
The centra ...
(ECC) can correct limited memory errors so that processing can continue.
*
CPU cache
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whi ...
errors in the processor.
Possible causes
Machine checks are a hardware problem, not a software problem. They are often the result of
overclocking
In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated sp ...
or overheating. In some cases, the CPU will shut itself off once passing a thermal limit to avoid permanent damage. But they can also be caused by bus errors introduced by other failing components, like memory or I/O devices. Possible causes include:
* Poor CPU cooling due to a
CPU heatsink and
case fans (or filters) that's clogged with dust or has come loose.
*
Overclocking
In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated sp ...
beyond the highest clock rate at which the CPU is still reliable.
* Failing motherboard.
* Failing processor.
* Failing memory.
* Failing I/O controllers, on either the motherboard or separate cards.
* Failing I/O devices.
* Inadequate or failing power supply.
Cooling problems are usually obvious upon inspection. A failing motherboard or processor can be identified by swapping them with functioning parts. Memory can be checked by booting to a diagnostic tool, like
memtest86. Non-essential failing I/O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears. If the failures typically only occur fairly soon after the OS is booted or not at all or not for days, it may be suggestive of a power supply issue. With a power supply problem, the failure often occurs when power demand peaks as the OS starts up any external devices for use.
Decoding MCEs
For IA-32 and Intel 64 processors, consult the Intel 64 and IA-32 Architectures Software Developer's Manual Chapter 15 (Machine-Check Architecture), or the Microsoft KB Article on Windows Exceptions.
Programs to decode Intel and AMD MCEs
* rasdaemon is a RAS (
reliability, availability and serviceability
Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The p ...
) logging tool for
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
. It records memory errors, using the EDAC tracing events.
EDAC is a Linux kernel subsystem that handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists. It is recommended to use rasdaemon to gather MCE information on Linux systems because mcelog has been deprecated as of 2017.
* mcelog is a Linux daemon by Andi Kleen to handle MCEs for x86 processors. mcelog can also decode machine checks. mcelog is considered functionally obsolete as of 2017.
The replacement of mcelog for Linux systems is rasdaemon.
* parsemce is a Linux program by Dave Jones to decode MCEs from
AMD K7
AMD Athlon is the brand name applied to a series of x86-compatible microprocessors designed and manufactured by Advanced Micro Devices. The original Athlon (now called Athlon Classic) was the first seventh-generation x86 processor and the fi ...
processors.
* mced
(mcedaemon) is a Linux program by Tim Hockin to gather MCEs from the kernel and alert interested applications. Note that it does not try to interpret the MCE data, it simply alerts other programs.
* mcat is a Windows command-line program from
AMD
Advanced Micro Devices, Inc. (AMD) is an American multinational corporation and technology company headquartered in Santa Clara, California and maintains significant operations in Austin, Texas. AMD is a hardware and fabless company that de ...
to decode MCEs from
AMD K8, Family
0x10 and
0x11 processors.
See also
*
Machine Check Architecture
In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system.
Intel's P6 and Pentium 4 family processors, AMD's K7 and K8 family processors, as well as the Itanium ...
(MCA)
*
High availability
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
There is now more dependence on these systems as a result of modernization ...
(HA)
*
Reliability, availability and serviceability
Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The p ...
(RAS)
*
Windows Hardware Error Architecture (WHEA)
*
Blue screen of death
The blue screen of death (BSoD) or blue screen error, blue screen, fatal error, bugcheck, and officially known as a stop erroris a fatal system error, critical error screen displayed by the Microsoft Windows operating systems to indicate a cr ...
*
Kernel panic
A kernel panic (sometimes abbreviated as KP) is a safety measure taken by an operating system's Kernel (operating system), kernel upon detecting an internal Fatal system error, fatal error in which either it is unable to safely recover or con ...
References
External links
mcelog: Advanced hardware error handling for x86 Linuxparsemce: Linux Machine check exception handler parser
Computer errors
{{compu-hardware-stub