Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system. The definition of MTBF depends on the definition of what is considered a failure. For complex,

repairable A repairable component is a component of a finished good that can be designated for repair. Overview Repairable components tend to be more expensive than non-repairable components (consumables). This is because for items that are inexpensive to ...

systems, failures are considered to be those out of design conditions which place the system out of service and into a state for repair. Failures which occur that can be left or maintained in an unrepaired condition, and do not place the system out of service, are not considered failures under this definition. In addition, units that are taken down for routine scheduled maintenance or inventory control are not considered within the definition of failure. The higher the MTBF, the longer a system is likely to work before failing.

Overview

Mean time between failures (MTBF) describes the expected time between two failures for a repairable system. For example, three identical systems starting to function properly at time 0 are working until all of them fail. The first system fails after 100 hours, the second after 120 hours and the third after 130 hours. The MTBF of the systems is the average of the three failure times, which is 116.667 hours. If the systems were non-repairable, then their MTTF would be 116.667 hours. In general, MTBF is the "up-time" between two failure states of a repairable system during operation as outlined here: For each observation, the "down time" is the instantaneous time it went down, which is after (i.e. greater than) the moment it went up, the "up time". The difference ("down time" minus "up time") is the amount of time it was operating between these two events. By referring to the figure above, the MTBF of a component is the sum of the lengths of the operational periods divided by the number of observed failures: :

\text = \frac.

In a similar manner, mean down time (MDT) can be defined as :

\text = \frac.

Calculation

MTBF is defined by the arithmetic mean value of the reliability function

R(t)

, which can be expressed as the expected value of the

density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...

f(t)

of time until failure:Alessandro Birolini: ''Reliability Engineering: Theory and Practice''. Springer, Berlin 2013, . :

\text = \int_0^\infty R(t)\, dt = \int_0^\infty tf(t)\, dt

Any practically-relevant calculation of MTBF or probabilistic failure prediction based on MTBF requires that the system is working within its "useful life period", which is characterized by a relatively constant failure rate (the middle part of the "

bathtub curve The bathtub curve is widely used in reliability engineering and deterioration modeling. It describes a particular form of the hazard function which comprises three parts: *The first part is a decreasing failure rate, known as early failures. *Th ...

") when only random failures are occurring. Assuming a constant failure rate

\lambda

results in a failure density function as follows:

f(t) = \lambda e^

, which, in turn, simplifies the above-mentioned calculation of MTBF to the reciprocal of the failure rate of the system :

\text = \frac. \!

The units used are typically hours or lifecycles. This critical relationship between a system's MTBF and its failure rate allows a simple conversion/calculation when one of the two quantities is known and an exponential distribution (constant failure rate, i.e., no systematic failures) can be assumed. The MTBF is the expected value, average or mean of the exponential distribution. Once the MTBF of a system is known, the

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...

that any one particular system will be operational at time equal to the MTBF can be estimated. Under the assumption of a constant failure rate, any one particular system will survive to its calculated MTBF with a probability of 36.8% (i.e., it will fail before with a probability of 63.2%). The same applies to the MTTF of a system working within this time period.

Application

The MTBF value can be used as a system reliability parameter or to compare different systems or designs. This value should only be understood conditionally as the “mean lifetime” (an average value), and not as a quantitative identity between working and failed units. Since MTBF can be expressed as “average life (expectancy)”, many engineers assume that 50% of items will have failed by time ''t'' = MTBF. This inaccuracy can lead to bad design decisions. Furthermore, probabilistic failure prediction based on MTBF implies the total absence of systematic failures (i.e., a constant failure rate with only intrinsic, random failures), which is not easy to verify. Assuming no systematic errors, the probability the system survives during a duration, T, is calculated as exp^(-T/MTBF). Hence the probability a system fails during a duration T, is given by 1 - exp^(-T/MTBF). MTBF value prediction is an important element in the development of products. Reliability engineers and design engineers often use reliability software to calculate a product's MTBF according to various methods and standards (MIL-HDBK-217F, Telcordia SR332, Siemens SN 29500, FIDES, UTE 80-810 (RDF2000), etc.). The Mil-HDBK-217 reliability calculator manual in combination with RelCalc software (or other comparable tool) enables MTBF reliability rates to be predicted based on design. A concept which is closely related to MTBF, and is important in the computations involving MTBF, is the

mean down time In organizational management, mean down time (MDT) is the average time that a system is non-operational. This includes all downtime associated with repair, corrective and preventive maintenance, self-imposed downtime, and any logistics or adminis ...

(MDT). MDT can be defined as mean time which the system is down after the failure. Usually, MDT is considered different from MTTR (Mean Time To Repair); in particular, MDT usually includes organizational and logistical factors (such as business days or waiting for components to arrive) while MTTR is usually understood as more narrow and more technical.

MTBF and MDT for networks of components

Two components

c_1,c_2

(for instance hard drives, servers, etc.) may be arranged in a network, in ''series'' or in ''parallel''. The terminology is here used by close analogy to electrical circuits, but has a slightly different meaning. We say that the two components are in series if the failure of ''either'' causes the failure of the network, and that they are in parallel if only the failure of ''both'' causes the network to fail. The MTBF of the resulting two-component network with repairable components can be computed according to the following formulae, assuming that the MTBF of both individual components is known: :

\text(c_1 ; c_2) = \frac = \frac  \;,

where

c_1 ; c_2

is the network in which the components are arranged in series. For the network containing parallel repairable components, to find out the MTBF of the whole system, in addition to component MTBFs, it is also necessary to know their respective MDTs. Then, assuming that MDTs are negligible compared to MTBFs (which usually stands in practice), the MTBF for the parallel system consisting from two parallel repairable components can be written as follows:

= \frac \;, \end

where

c_1 \parallel c_2

is the network in which the components are arranged in parallel, and

PF(c,t)

is the probability of failure of component

c

during "vulnerability window"

t

. Intuitively, both these formulae can be explained from the point of view of failure probabilities. First of all, let's note that the probability of a system failing within a certain timeframe is the inverse of its MTBF. Then, when considering series of components, failure of any component leads to the failure of the whole system, so (assuming that failure probabilities are small, which is usually the case) probability of the failure of the whole system within a given interval can be approximated as a sum of failure probabilities of the components. With parallel components the situation is a bit more complicated: the whole system will fail if and only if after one of the components fails, the other component fails while the first component is being repaired; this is where MDT comes into play: the faster the first component is repaired, the less is the "vulnerability window" for the other component to fail. Using similar logic, MDT for a system out of two serial components can be calculated as: :

\text(c_1 ; c_2) = \frac  \;,

and for a system out of two parallel components MDT can be calculated as: :

\text(c_1 \parallel c_2) = \frac  \;.

Through successive application of these four formulae, the MTBF and MDT of any network of repairable components can be computed, provided that the MTBF and MDT is known for each component. In a special but all-important case of several serial components, MTBF calculation can be easily generalised into :

\text(c_1;\dots; c_n) = \left(\sum_^n \frac 1\right)^\;,

which can be shown by induction, and likewise :

\text(c_1\parallel\dots\parallel c_n) = \left(\sum_^n \frac 1\right)^\;,

since the formula for the mdt of two components in parallel is identical to that of the mtbf for two components in series.

Variations of MTBF

There are many variations of MTBF, such as ''mean time between system aborts'' (MTBSA), ''mean time between critical failures'' (MTBCF) or ''mean time between unscheduled removal'' (MTBUR). Such nomenclature is used when it is desirable to differentiate among types of failures, such as critical and non-critical failures. For example, in an automobile, the failure of the FM radio does not prevent the primary operation of the vehicle. It is recommended to use ''Mean time to failure'' (MTTF) instead of MTBF in cases where a system is replaced after a failure ("non-repairable system"), since MTBF denotes time between failures in a system which can be repaired.

MTTFd Mean Time to Dangerous Failure. In a safety system MTTFD is the portion of failure modes that can lead to failures that may result in hazards to personnel, environment or equipment. MTTFD is critical to the determination of the performance level o ...

is an extension of MTTF, and is only concerned about failures which would result in a dangerous condition. It can be calculated as follows: :

\text & \approx \frac, \end

where ''B''₁₀ is the number of operations that a device will operate prior to 10% of a sample of those devices would fail and ''n''_op is number of operations. ''B''_10d is the same calculation, but where 10% of the sample would fail to danger. ''n''_op is the number of operations/cycle in one year.

MTBF considering censoring

In fact the MTBF counting only failures with at least some systems still operating that have not yet failed underestimates the MTBF by failing to include in the computations the partial lifetimes of the systems that have not yet failed. With such lifetimes, all we know is that the time to failure exceeds the time they've been running. This is called censoring. In fact with a parametric model of the lifetime, the likelihood for the experience on any given day is as follows: :

L = \prod_i \lambda(u_i)^ S(u_i)

, where :

u_i

is the failure time for failures and the censoring time for units that have not yet failed, :

\delta_i

= 1 for failures and 0 for censoring times, :

S(u_i)

= the probability that the lifetime exceeds

u_i

, called the survival function, and :

\lambda(u_i) = f(u)/S(u)

is called the

hazard function Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time. It is usually denoted by the Greek letter λ (lambda) and is often used in reliability engineering. The failure rate of a ...

, the instantaneous force of mortality (where

f(u)

= the probability density function of the distribution). For a constant exponential distribution, the hazard,

\lambda

, is constant. In this case, the MBTF is :MTBF =

1 / \hat\lambda = \sum u_i / k

, where

\hat\lambda

is the maximum likelihood estimate of

\lambda

, maximizing the likelihood given above and

k = \sum \sigma_i

is the number of uncensored observations. We see that the difference between the MTBF considering only failures and the MTBF including censored observations is that the censoring times add to the numerator but not the denominator in computing the MTBF..

References

External links

* * * * {{DEFAULTSORT:Mean Time Between Failures Engineering failures Survival analysis Reliability indices