Predictive failure analysis
   HOME

TheInfoList



OR:

Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components (software or hardware), and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure. For example, computer mechanisms that analyze trends in corrected errors to predict future failures of hardware/memory components and proactively enabling mechanisms to avoid them. Predictive Failure Analysis was originally used as term for a proprietary IBM technology for monitoring the likelihood of
hard disk drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
s to fail, although the term is now used generically for a variety of technologies for judging the imminent failure of CPU's, memory and I/O devices. See also
first failure data capture First or 1st is the ordinal form of the number one (#1). First or 1st may also refer to: *World record, specifically the first instance of a particular achievement Arts and media Music * 1$T, American rapper, singer-songwriter, DJ, and reco ...
.


Disks

IBM introduced the term ''PFA'' and its technology in 1992 with reference to its 0662-S1x drive (1052 MB Fast-Wide SCSI-2 disk which operated at 5400
rpm Revolutions per minute (abbreviated rpm, RPM, rev/min, r/min, or with the notation min−1) is a unit of rotational speed or rotational frequency for rotating machines. Standards ISO 80000-3:2019 defines a unit of rotation as the dimensionl ...
). The technology relies on measuring several key (mainly mechanical) parameters of the drive unit, for example the flying height of
head A head is the part of an organism which usually includes the ears, brain, forehead, cheeks, chin, eyes, nose, and mouth, each of which aid in various sensory functions such as sight, hearing, smell, and taste. Some very simple animals may ...
s. The drive
firmware In computing, firmware is a specific class of computer software that provides the low-level control for a device's specific hardware. Firmware, such as the BIOS of a personal computer, may contain basic functions of a device, and may provide h ...
compares the measured parameters against predefined thresholds and evaluates the health status of the drive. If the drive appears likely to fail soon, the system sends notification to the disk controller. The major drawbacks of the technology included: * the binary result - the only status visible to the host was presence or absence of a notification * the unidirectional communications - the drive firmware sending notification The technology merged with IntelliSafe to form the
Self-Monitoring, Analysis, and Reporting Technology Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T., often written as SMART) is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs). Its primary function is to detect and report various indicat ...
(SMART).


Processor and Memory

High counts of corrected RAM intermittent errors by ECC can be predictive of future
DIMM A DIMM () (Dual In-line Memory Module), commonly called a RAM stick, comprises a series of dynamic random-access memory integrated circuits. These memory modules are mounted on a printed circuit board and designed for use in personal compute ...
failures and so automatic offlining for memory and CPU caches can be used to avoid future errors, for example under the
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
operating system the mcelog
daemon Daimon or Daemon (Ancient Greek: , "god", "godlike", "power", "fate") originally referred to a lesser deity or guiding spirit such as the daimons of ancient Greek religion and Greek mythology, mythology and of later Hellenistic religion and Hell ...
will automatically remove from usage memory pages showing excessive corrections, and will remove from usage processor cores showing excessive cache correctable memory errors.


Optical media

On
optical media In computing and optical disc recording technologies, an optical disc (OD) is a flat, usually circular disc that encodes binary data ( bits) in the form of pits and lands on a special material, often aluminum, on one of its flat surfaces ...
( CD,
DVD The DVD (common abbreviation for Digital Video Disc or Digital Versatile Disc) is a digital optical disc data storage format. It was invented and developed in 1995 and first released on November 1, 1996, in Japan. The medium can store any kind ...
and
Blu-ray The Blu-ray Disc (BD), often known simply as Blu-ray, is a digital optical disc data storage format. It was invented and developed in 2005 and released on June 20, 2006 worldwide. It is designed to supersede the DVD format, and capable of sto ...
), failures caused by degradation of media can be predicted and media of low manufacturing quality can be detected prior to data loss occurring by measuring the rate of correctable data errors using software such as QpxTool or
Nero DiscSpeed Nero Multimedia Suite is a software suite for Microsoft Windows that is developed and marketed by Nero AG. Version 2017 of this product was released in October 2016. Version differences Since its Version 10, Nero provides two variants of the s ...
. However, not all vendors and models of optical drives allow error scanning.List of supported devices by dosc quality scanning software ''QPxTool''
/ref>


References


See also


MCELog- Linux daemon for processing of x86 machine checks for predictive failure analysis
Hard disk computer storage IBM storage devices {{Compu-storage-stub