HOME

TheInfoList



OR:

In
science Science is a systematic endeavor that Scientific method, builds and organizes knowledge in the form of Testability, testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earli ...
and
engineering Engineering is the use of scientific method, scientific principles to design and build machines, structures, and other items, including bridges, tunnels, roads, vehicles, and buildings. The discipline of engineering encompasses a broad rang ...
, root cause analysis (RCA) is a method of
problem solving Problem solving is the process of achieving a goal by overcoming obstacles, a frequent part of most activities. Problems in need of solutions range from simple personal tasks (e.g. how to turn on an appliance) to complex issues in business an ...
used for identifying the root causes of faults or problems. It is widely used in
IT operations Data center management is the collection of tasks performed by those responsible for managing ongoing operation of a data center This includes ''Business service management'' and planning for the future. Historically, ''data center management'' wa ...
, manufacturing,
telecommunications Telecommunication is the transmission of information by various types of technologies over wire, radio, optical, or other electromagnetic systems. It has its origin in the desire of humans for communication over a distance greater than tha ...
, industrial process control,
accident analysis Accident analysis is carried out in order to determine the cause or causes of an accident (that can result in single or multiple outcomes) so as to prevent further accidents of a similar kind. It is part of ''accident investigation or incident in ...
(e.g., in
aviation Aviation includes the activities surrounding mechanical flight and the aircraft industry. ''Aircraft'' includes fixed-wing and rotary-wing types, morphable wings, wing-less lifting bodies, as well as lighter-than-air craft such as hot ...
,
rail transport Rail transport (also known as train transport) is a means of transport that transfers passengers and goods on wheeled vehicles running on rails, which are incorporated in Track (rail transport), tracks. In contrast to road transport, where the ...
, or
nuclear plant A nuclear power plant (NPP) is a thermal power station in which the heat source is a nuclear reactor. As is typical of thermal power stations, heat is used to generate steam that drives a steam turbine connected to a generator that produces elec ...
s),
medicine Medicine is the science and Praxis (process), practice of caring for a patient, managing the diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, and Health promotion ...
(for
medical diagnosis Medical diagnosis (abbreviated Dx, Dx, or Ds) is the process of determining which disease or condition explains a person's symptoms and signs. It is most often referred to as a diagnosis with the medical context being implicit. The information r ...
),
healthcare industry The healthcare industry (also called the medical industry or health economy) is an aggregation and integration of sectors within the economic system that provides goods and services to treat patients with curative, preventive, rehabilitative, ...
(e.g., for
epidemiology Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population. It is a cornerstone of public health, and shapes policy decisions and evide ...
), etc. Root cause analysis is a form of deductive inference since it requires an understanding of the underlying causal mechanisms of the potential root causes and the problem. RCA can be decomposed into four steps: * Identify and describe the problem clearly * Establish a timeline from the normal situation until the problem occurs * Distinguish between the root cause and other causal factors (e.g., using
event correlation Event correlation is a technique for making sense of a large number of events and pinpointing the few events that are really important in that mass of information. This is accomplished by looking for and analyzing relationships between events. Hi ...
) * Establish a
causal graph In statistics, econometrics, epidemiology, genetics and related disciplines, causal graphs (also known as path diagrams, causal Bayesian networks or DAGs) are probabilistic graphical models used to encode assumptions about the data-generating pro ...
between the root cause and the problem RCA generally serves as input to a remediation process whereby
corrective actions Corrective and preventive action (CAPA or simply corrective action) consists of improvements to an organization's processes taken to eliminate causes of non-conformities or other undesirable situations. It is usually a set of actions, laws or regu ...
are taken to prevent the problem from recurring. The name of this process varies from one application domain to another. According to
ISO/IEC 31010 ISO/IEC 31010 is a standard concerning risk management codified by The International Organization for Standardization and The International Electrotechnical Commission (IEC). The full name of the standard is ISO.IEC 31010:2009 – Risk management � ...
, RCA may include the techniques
Five whys Five whys (or 5 whys) is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. The primary goal of the technique is to determine the root cause of a defect or problem by repeati ...
,
Failure mode and effects analysis Failure mode and effects analysis (FMEA; often written with "failure modes" in plural) is the process of reviewing as many components, assemblies, and subsystems as possible to identify potential failure modes in a system and their causes and effe ...
(FMEA),
Fault tree analysis Fault tree analysis (FTA) is a type of failure analysis in which an undesired state of a system is examined. This analysis method is mainly used in safety engineering and reliability engineering to understand how systems can fail, to identify t ...
,
Ishikawa diagram Ishikawa diagrams (also called fishbone diagrams, herringbone diagrams, cause-and-effect diagrams) are causal diagrams created by Kaoru Ishikawa that show the potential causes of a specific event. Common uses of the Ishikawa diagram are product ...
, and
Pareto analysis Pareto analysis is a formal technique useful where many possible courses of action are competing for attention. In essence, the problem-solver estimates the benefit delivered by each action, then selects a number of the most effective actions tha ...
.


Definitions

There are essentially two ways of repairing faults and solving problems in science and engineering. Reactive management consists of reacting quickly after the problem occurs, by treating the symptoms. This type of management is implemented by reactive systems, self-adaptive systems,
self-organized systems Self-organization, also called spontaneous order in the social sciences, is a process where some form of overall order arises from local interactions between parts of an initially disordered system. The process can be spontaneous when suffic ...
, and
complex adaptive system A complex adaptive system is a system that is '' complex'' in that it is a dynamic network of interactions, but the behavior of the ensemble may not be predictable according to the behavior of the components. It is '' adaptive'' in that the indi ...
s. The goal here is to react quickly and alleviate the effects of the problem as soon as possible. Proactive management, conversely, consists of preventing problems from occurring. Many techniques can be used for this purpose, ranging from good practices in design to analyzing in detail problems that have already occurred and taking actions to make sure they never recur. Speed is not as important here as the
accuracy and precision Accuracy and precision are two measures of '' observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their '' true value'', while ''precision'' is how close the measurements are to each ot ...
of the diagnosis. The focus is on addressing the real cause of the problem rather than its effects. Root cause analysis is often used in proactive management to identify the root cause of a problem, that is, the factor that was the leading cause. It is customary to refer to the "root cause" in singular form, but one or several factors may constitute the ''root cause(s)'' of the problem under study. A factor is considered the "root cause" of a problem if removing it prevents the problem from recurring. Conversely, a "causal factor" is a contributing action that affects an incident/event's outcome but is not the root cause. Although removing a causal factor can benefit an outcome, it does not prevent its recurrence with certainty.


Example

Imagine an investigation into a machine that stopped because it was overloaded and the fuse blew. Investigation shows that the machine was overloaded because it had a bearing that wasn't being sufficiently lubricated. The investigation proceeds further and finds that the automatic lubrication mechanism had a pump that was not pumping sufficiently, hence the lack of lubrication. Investigation of the pump shows that it has a worn shaft. Investigation of why the shaft was worn discovers that there isn't an adequate mechanism to prevent metal scrap getting into the pump. This enabled scrap to get into the pump, and damage it. The apparent root cause of the problem is that metal scrap can contaminate the lubrication system. Fixing this problem ought to prevent the whole sequence of events recurring. The ''real'' root cause could be a design issue if there is no filter to prevent the metal scrap getting into the system. Or if it has a filter that was blocked due to lack of routine inspection, then the ''real'' root cause is a maintenance issue. Compare this with an investigation that does not find the root cause: replacing the fuse, the bearing, or the lubrication pump will probably allow the machine to go back into operation for a while. But there is a risk that the problem will simply recur, until the root cause is dealt with. The above doesn't include cost/benefit analysis: does the cost of replacing one or more machines exceed the cost of downtime until the fuse is replaced? This situation is sometimes referred to as ''the cure being worse than the disease''. As an unrelated example of the conclusions that can be drawn in the absence of the cost/benefit analysis, consider the tradeoff between some claimed benefits of population decline: In the short term there will be fewer payers into pension/retirement systems; whereas halting the population will require higher taxes to cover the cost of building more schools. This can help explain the problem of the cure being worse than the disease. Costs to consider go beyond finances when considering the personnel who operate the machinery. Ultimately, the goal is to prevent downtime; but more so prevent catastrophic injuries. Prevention begins with being proactive.


Application domains

Root cause analysis is used in many application domains.


Manufacturing and industrial process control

The example above illustrates how RCA can be used in
manufacturing Manufacturing is the creation or production of goods with the help of equipment, labor, machines, tools, and chemical or biological processing or formulation. It is the essence of secondary sector of the economy. The term may refer to a ...
. RCA is also routinely used in industrial process control, e.g. to control the production of chemicals ( quality control). RCA is also used for
failure analysis Failure analysis is the process of collecting and analyzing data to determine the cause of a failure, often with the goal of determining corrective actions or liability. According to Bloch and Geitner, ”machinery failures reveal a reaction chain o ...
in
engineering Engineering is the use of scientific method, scientific principles to design and build machines, structures, and other items, including bridges, tunnels, roads, vehicles, and buildings. The discipline of engineering encompasses a broad rang ...
and
maintenance Maintenance may refer to: Biological science * Maintenance of an organism * Maintenance respiration Non-technical maintenance * Alimony, also called ''maintenance'' in British English * Champerty and maintenance, two related legal doctrine ...
.


IT and telecommunications

Root cause analysis is frequently used in IT and telecommunications to detect the root causes of serious problems. For example, in the
ITIL The Information Technology Infrastructure Library (ITIL) is a set of detailed practices for IT activities such as IT service management (ITSM) and IT asset management (ITAM) that focus on aligning IT services with the needs of business. ITIL d ...
service management framework, the goal of
incident management An incident is an event that could lead to loss of, or disruption to, an organization's operations, services or functions. Incident management (IcM) is a term describing the activities of an organization to identify, analyze, and correct hazards ...
is to resume a faulty IT service as soon as possible (reactive management), whereas
problem management Problem management is the process responsible for managing the lifecycle of all problems that happen or could happen in an IT service. The primary objectives of problem management are to prevent problems and resulting incidents from happening, to ...
deals with solving recurring problems for good by addressing their root causes (proactive management). Another example is the computer security incident management process, where root-cause analysis is often used to investigate security breaches. RCA is also used in conjunction with
business activity monitoring Business activity monitoring (BAM) is software that aids the monitoring of business activities which are implemented in computer systems. The term was originally coined by analysts at Gartner, Inc. and refers to the aggregation, analysis, and ...
and
complex event processing Event processing is a method of tracking and analyzing (processing) streams of information (data) about things that happen (events), and deriving a conclusion from them. Complex event processing, or CEP, consists of a set of concepts and techniques ...
to analyze faults in
business process A business process, business method or business function is a collection of related, structured activities or tasks by people or equipment in which a specific sequence produces a service or product (serves a particular business goal) for a parti ...
es. Its use in the IT industry cannot always be compared to its use in safety critical industries, since in normality the use of RCA in IT industry is ''not'' supported by pre-existing fault trees or other design specs. Instead a mixture of debugging, event based detection and monitoring systems (where the services are individually modelled) is normally supporting the analysis. Training and supporting tools like simulation or different in-depth runbooks for all expected scenarios do not exist, instead they are created after the fact based on issues seen as 'worthy'. As a result the analysis is often limited to those things that have monitoring/observation interfaces and not the actual planned/seen function with focus on verification of inputs and outputs. Hence, the saying "there is no root cause" has become common in the IT industry.


Health and safety

In the domains of
health Health, according to the World Health Organization, is "a state of complete physical, mental and social well-being and not merely the absence of disease and infirmity".World Health Organization. (2006)''Constitution of the World Health Organiza ...
and
safety Safety is the state of being "safe", the condition of being protected from harm or other danger. Safety can also refer to the control of recognized hazards in order to achieve an acceptable level of risk. Meanings There are two slightly di ...
, RCA is routinely used in
medicine Medicine is the science and Praxis (process), practice of caring for a patient, managing the diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, and Health promotion ...
(diagnosis) and
epidemiology Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population. It is a cornerstone of public health, and shapes policy decisions and evide ...
(e.g., to identify the source of an infectious disease), where causal inference methods often require both clinical and statistical expertise to make sense of the complexities of the processes. RCA is used in
environmental science Environmental science is an interdisciplinary academic field that integrates physics, biology, and geography (including ecology, chemistry, plant science, zoology, mineralogy, oceanography, limnology, soil science, geology and physical ...
(e.g., to analyze environmental disasters),
accident analysis Accident analysis is carried out in order to determine the cause or causes of an accident (that can result in single or multiple outcomes) so as to prevent further accidents of a similar kind. It is part of ''accident investigation or incident in ...
(aviation and rail industry), and
occupational safety and health Occupational safety and health (OSH), also commonly referred to as occupational health and safety (OHS), occupational health, or occupational safety, is a multidisciplinary field concerned with the safety, health, and welfare of people at ...
. In the manufacture of medical devices, pharmaceuticals, food, and dietary supplements, root cause analysis is a regulatory requirement.


Systems analysis

RCA is also used in
change management Change management (sometimes abbreviated as CM) is a collective term for all approaches to prepare, support, and help individuals, teams, and organizations in making organizational change. It includes methods that redirect or redefine the use o ...
, risk management, and
systems analysis Systems analysis is "the process of studying a procedure or business to identify its goal and purposes and create systems and procedures that will efficiently achieve them". Another view sees system analysis as a problem-solving technique tha ...
.


General principles

Despite the different approaches among the various schools of root cause analysis and the specifics of each application domain, RCA generally follows the same four steps: # Identification and description: Effective
problem statement A problem statement is a concise description of an issue to be addressed or a condition to be improved upon. It identifies the gap between the current (problem) state and desired (goal) state of a process or product. Focusing on the facts, the prob ...
s and event descriptions (as failures, for example) are helpful and usually required to ensure the execution of appropriate root cause analyses. # Chronology: RCA should establish a
sequence of events Time is the continued sequence of existence and events that occurs in an apparently irreversible succession from the past, through the present, into the future. It is a component quantity of various measurements used to sequence events, to co ...
or
timeline A timeline is a display of a list of events in chronological order. It is typically a graphic design showing a long bar labelled with dates paralleling it, and usually contemporaneous events. Timelines can use any suitable scale represen ...
for understanding the relationships between contributory (causal) factors, the root cause, and the problem under investigation. # Differentiation: By correlating this sequence of events with the nature, the magnitude, the location, and the timing of the problem, and possibly also with a library of previously analyzed problems, RCA should enable the investigator(s) to distinguish between the root cause, causal factors, and non-causal factors. One way to trace down root causes consists in using
hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into tw ...
and data-mining solutions (such as graph-theory-based data mining). Another consists in comparing the situation under investigation with past situations stored in case libraries, using
case-based reasoning In artificial intelligence and philosophy, case-based reasoning (CBR), broadly construed, is the process of solving new problems based on the solutions of similar past problems. In everyday life, an auto mechanic who fixes an engine by recal ...
tools. # Causal graphing: Finally, the investigator should be able to extract from the sequences of events a subsequence of key events that explain the problem, and convert it into a
causal graph In statistics, econometrics, epidemiology, genetics and related disciplines, causal graphs (also known as path diagrams, causal Bayesian networks or DAGs) are probabilistic graphical models used to encode assumptions about the data-generating pro ...
. To be effective, root cause analysis must be performed systematically. The process enables the chance to not miss any other important details. A team effort is typically required, and ideally all persons involved should arrive at the same conclusion. In aircraft accident analyses, for example, the conclusions of the investigation and the root causes that are identified must be backed up by documented evidence.See .


Transition to corrective actions

The goal of RCA is to identify the root cause of the problem with the intent to stop the problem from recurring or worsening. The next step is to trigger long-term corrective actions to address the root cause identified during RCA, and make sure that the problem does not resurface. Correcting a problem is not formally part of RCA, however; these are different steps in a problem-solving process known as
fault management In network management, fault management is the set of functions that detect, isolate, and correct malfunctions in a telecommunications network, compensate for environmental changes, and include maintaining and examining error logs, accepting and ...
in IT and telecommunications,
repair The technical meaning of maintenance involves functional checks, servicing, repairing or replacing of necessary devices, equipment, machinery, building infrastructure, and supporting utilities in industrial, business, and residential install ...
in engineering, remediation in aviation,
environmental remediation Environmental remediation deals with the removal of pollution or contaminants from environmental media such as soil, groundwater, sediment, or surface water. Remedial action is generally subject to an array of regulatory requirements, and may ...
in
ecology Ecology () is the study of the relationships between living organisms, including humans, and their physical environment. Ecology considers organisms at the individual, population, community, ecosystem, and biosphere level. Ecology overl ...
,
therapy A therapy or medical treatment (often abbreviated tx, Tx, or Tx) is the attempted remediation of a health problem, usually following a medical diagnosis. As a rule, each therapy has indications and contraindications. There are many differe ...
in
medicine Medicine is the science and Praxis (process), practice of caring for a patient, managing the diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, and Health promotion ...
, etc.


Challenges

Without delving in the idiosyncrasies of specific problems, several general conditions can make RCA more difficult than it may appear at first sight. First, important information is often missing because it is generally not possible, in practice, to monitor everything and store all monitoring data for a long time. Second, gathering data and evidence, and classifying them along a timeline of events to the final problem, can be nontrivial. In telecommunications, for instance, distributed monitoring systems typically manage between a million and a billion events per day. Finding a few relevant events in such a mass of irrelevant events is asking to find the proverbial
needle in a haystack Needle may refer to: Crafting * Crochet needle, a tool for making loops in thread or yarn * Knitting needle, a tool for knitting, not as sharp as a sewing needle * Sewing needle, a long slender tool with a pointed tip * Trussing needle, a long s ...
. Third, there may be more than one root cause for a given problem, and this multiplicity can make the causal graph very difficult to establish. Fourth, causal graphs often have many levels, and root-cause analysis terminates at a level that is "root" to the eyes of the investigator. Looking again at the example above in industrial process control, a deeper investigation could reveal that the maintenance procedures at the plant included periodic inspection of the lubrication subsystem every two years, while the current lubrication subsystem vendor's product specified a 6-month period. Switching vendors may have been due to management's desire to save money, and a failure to consult with engineering staff on the implication of the change on maintenance procedures. Thus, while the "root cause" shown above may have prevented the quoted recurrence, it would not have prevented other perhaps more severe failures affecting other machines.


See also

* * * * * * * * * * * * * * * * *


Notes


References

* * * * * * * * *


External links


"Root Cause Analysis and Monitoring Tools: A Perfect Match" by Irene Carrasco

"Cause Mapping a visual explanation"

"Sologic Root Cause Analysis Method"

"Advanced Root Cause Analysis"
{{DEFAULTSORT:Root Cause Analysis Quality Problem solving