High Availability Computing
   HOME

TheInfoList



OR:

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually
uptime Uptime is a Measurement, measure of system reliability, expressed as the period of system time, time a machine, typically a computer, has been continuously working and available. Uptime is the opposite of downtime. It is often used as a measure ...
, for a higher than normal period. There is now more dependence on these systems as a result of modernization. For example, to carry out their regular daily tasks, hospitals and data centers need their systems to be highly available.
Availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
refers to the ability of the user to access a service or system, whether to submit new work, update or modify existing work, or retrieve the results of previous work. If a user cannot access the system, it is considered ''unavailable from the user's perspective''. The term ''
downtime In computing and telecommunications, downtime (also (system) outage or (system) drought colloquially) is a period when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline. This is ...
'' is generally used to refer to describe periods when a system is unavailable.


Resilience

High availability is a property of
network Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...
resilience, the ability to "provide and maintain an acceptable level of service in the face of faults and challenges to normal operation." Threats and challenges for services can range from simple misconfiguration over large scale natural disasters to targeted attacks. As such, network resilience touches a very wide range of topics. In order to increase the resilience of a given communication network, the probable challenges and risks have to be identified and appropriate resilience metrics have to be defined for the service to be protected. The importance of network resilience is continuously increasing, as communication networks are becoming a fundamental component in the operation of critical infrastructures. Consequently, recent efforts focus on interpreting and improving network and computing resilience with applications to critical infrastructures. As an example, one can consider as a resilience objective the provisioning of services over the network, instead of the services of the network itself. This may require coordinated response from both the network and from the services running on top of the network. These services include: * supporting
distributed processing Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system commun ...
* supporting network storage * maintaining service of communication services such as **
video conferencing Videotelephony (also known as videoconferencing or video calling) is the use of audio signal, audio and video for simultaneous two-way communication. Today, videotelephony is widespread. There are many terms to refer to videotelephony. ''Vide ...
**
instant messaging Instant messaging (IM) technology is a type of synchronous computer-mediated communication involving the immediate ( real-time) transmission of messages between two or more parties over the Internet or another computer network. Originally involv ...
** online collaboration * access to applications and data as needed Resilience and
survivability Survivability is the ability to remain alive or continue to exist. The term has more specific meaning in certain contexts. Ecological Following disruptive forces such as flood, fire, disease, war, or climate change some species of flora, faun ...
are interchangeably used according to the specific context of a given study.


Principles

There are three principles of
systems design The basic study of system design is the understanding of component parts and their subsequent interaction with one another. Systems design has appeared in a variety of fields, including sustainability, computer/software architecture, and sociolog ...
in
reliability engineering Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability is defined as the probability that a product, system, or service will perform its intended functi ...
that can help achieve high availability. # Elimination of single points of failure. This means adding or building redundancy into the system so that failure of a component does not mean failure of the entire system. # Reliable crossover. In redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover. # Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure – but the maintenance activity must.


Scheduled and unscheduled downtime

A distinction can be made between scheduled and unscheduled downtime. Typically,
scheduled downtime In computing and telecommunications, downtime (also (system) outage or (system) drought colloquially) is a period when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline. This is u ...
is a result of
maintenance The technical meaning of maintenance involves functional checks, servicing, repairing or replacing of necessary devices, equipment, machinery, building infrastructure and supporting utilities in industrial, business, and residential installa ...
that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to
system software System software is software designed to provide a platform for other software. An example of system software is an operating system (OS) (like macOS, Linux, Android, and Microsoft Windows). Application software is software that allows users to d ...
that require a
reboot In computing, rebooting is the process by which a running computer system is restarted, either intentionally or unintentionally. Reboots can be either a cold reboot (alternatively known as a hard reboot) in which the power to the system is physi ...
or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or
RAM Ram, ram, or RAM most commonly refers to: * A male sheep * Random-access memory, computer memory * Ram Trucks, US, since 2009 ** List of vehicles named Dodge Ram, trucks and vans ** Ram Pickup, produced by Ram Trucks Ram, ram, or RAM may also ref ...
components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various application,
middleware Middleware is a type of computer software program that provides services to software applications beyond those available from the operating system. It can be described as "software glue". Middleware makes it easier for software developers to imple ...
, and
operating system An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ...
failures. If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled. Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of
continuous availability Continuous availability is an approach to computer system and application design that protects users against downtime, whatever the cause and ensures that users remain connected to their documents, data files and business applications. Continuous av ...
. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any
single point of failure A single point of failure (SPOF) is a part of a system that would Cascading failure, stop the entire system from working if it were to fail. The term single point of failure implies that there is not a backup or redundant option that would enab ...
and allow online hardware, network, operating system,
middleware Middleware is a type of computer software program that provides services to software applications beyond those available from the operating system. It can be described as "software glue". Middleware makes it easier for software developers to imple ...
, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example, system downtime at an office building after everybody has gone home for the night.


Percentage calculation

Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously.
Service level agreement A service-level agreement (SLA) is an agreement between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. T ...
s often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.
The terms
uptime Uptime is a Measurement, measure of system reliability, expressed as the period of system time, time a machine, typically a computer, has been continuously working and available. Uptime is the opposite of downtime. It is often used as a measure ...
and
availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
are often used interchangeably but do not always refer to the same thing. For example, a system can be "up" with its services not "available" in the case of a
network outage In computing and telecommunications, downtime (also (system) outage or (system) drought colloquially) is a period when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline. This is u ...
. Or a system undergoing software maintenance can be "available" to be worked on by a
system administrator An IT administrator, system administrator, sysadmin, or admin is a person who is responsible for the upkeep, configuration, and reliable operation of computer systems, especially multi-user computers, such as Server (computing), servers. The ...
, but its services do not appear "up" to the
end user In product development, an end user (sometimes end-user) is a person who ultimately uses or is intended to ultimately use a product. The end user stands in contrast to users who support or maintain the product, such as sysops, system administrato ...
or customer. The subject of the terms is thus important here: whether the focus of a discussion is the server hardware, server OS, functional service, software service/process, or similar, it is only if there is a single, consistent subject of the discussion that the words uptime and availability can be used synonymously.


Five-by-five mnemonic

A simple mnemonic rule states that ''5 nines'' allows approximately 5 minutes of downtime per year. Variants can be derived by multiplying or dividing by 10: 4 nines is 50 minutes and 3 nines is 500 minutes. In the opposite direction, 6 nines is 0.5 minutes (30 sec) and 7 nines is 3 seconds.


"Powers of 10" trick

Another memory trick to calculate the allowed downtime duration for an "n-nines" availability percentage is to use the formula 8.64 \times 10^ seconds per day. For example, 90% ("one nine") yields the exponent 4 - 1 = 3, and therefore the allowed downtime is 8.64 \times 10^3 seconds per day. Also, 99.999% ("five nines") gives the exponent 4 - 5 = -1, and therefore the allowed downtime is 8.64 \times 10^ seconds per day.


"Nines"

Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions ( blackouts, brownouts or surges) 99.999% of the time would have 5 nines reliability, or class five. In particular, the term is used in connection with
mainframes A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise ...
or enterprise computing, often as part of a
service-level agreement A service-level agreement (SLA) is an agreement between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. T ...
. Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then "five", so 99.95% is "three nines five", abbreviated 3N5. This is casually referred to as "three and a half nines", but this is incorrect: a 5 is only a factor of 2, while a 9 is a factor of 10, so a 5 is 0.3 nines (per below formula: \log_ 2 \approx 0.3): 99.95% availability is 3.3 nines, not 3.5 nines. More simply, going from 99.9% availability to 99.95% availability is a factor of 2 (0.1% to 0.05% unavailability), but going from 99.95% to 99.99% availability is a factor of 5 (0.05% to 0.01% unavailability), over twice as much. A formulation of the ''class of 9s'' c based on a system's unavailability x would be : c := \lfloor - \log_ x \rfloor (cf.
Floor and ceiling functions In mathematics, the floor function is the function that takes as input a real number , and gives as output the greatest integer less than or equal to , denoted or . Similarly, the ceiling function maps to the least integer greater than or eq ...
). A similar measurement is sometimes used to describe the purity of substances. In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a
probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...
(like 0.00001), or a
downtime In computing and telecommunications, downtime (also (system) outage or (system) drought colloquially) is a period when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline. This is ...
per year is quoted. Availability specified as a number of nines is often seen in
marketing Marketing is the act of acquiring, satisfying and retaining customers. It is one of the primary components of Business administration, business management and commerce. Marketing is usually conducted by the seller, typically a retailer or ma ...
documents. The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence. For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link
bit error rate In digital transmission, the number of bit errors is the number of received bits of a data stream over a communication channel that have been altered due to noise, interference, distortion or bit synchronization errors. The bit error rate (BER) ...
s. Sometimes the humorous term "nine fives" (55.5555555%) is used to contrast with "five nines" (99.999%), though this is not an actual goal, but rather a sarcastic reference to something totally failing to meet any reasonable target.


Measurement and interpretation

Availability measurement is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% uptime. However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users – a true availability measure is holistic. Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand. An alternative metric is mean time between failures (MTBF).


Closely related concepts

Recovery time (or estimated time of repair (ETR), also known as recovery time objective (RTO) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Another metric is
mean time to recovery Mean time to recovery (MTTR) is the average time that a device will take to recover from any failure. Examples of such devices range from self-resetting fuses (where the MTTR would be very short, probably seconds), to whole systems which have t ...
(MTTR). Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary
disaster recovery IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, ...
data center. Another related concept is data availability, that is the degree to which
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s and other information storage systems faithfully record and report system transactions. Information management often focuses separately on data availability, or Recovery Point Objective, in order to determine acceptable (or actual)
data loss Data loss is an error condition in information systems in which information is destroyed by failures (like failed spindle motors or head crashes on hard drives) or neglect (like mishandling, careless handling or storage under unsuitable conditions) ...
with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss. A
service level agreement A service-level agreement (SLA) is an agreement between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. T ...
("SLA") formalizes an organization's availability objectives and requirements.


Military control systems

High availability is one of the primary requirements of the
control system A control system manages, commands, directs, or regulates the behavior of other devices or systems using control loops. It can range from a single home heating controller using a thermostat controlling a domestic boiler to large industrial ...
s in unmanned vehicles and autonomous maritime vessels. If the controlling system becomes unavailable, the
Ground Combat Vehicle The Ground Combat Vehicle (GCV) was a program initiated by the United States Army in 2009, with the goal of developing a next-generation armored fighting vehicle. The first variant of the GCV to be developed would be an infantry fighting vehicle ...
(GCV) or ASW Continuous Trail Unmanned Vessel (ACTUV) would be lost.


System design

On one hand, adding more components to an overall system design can undermine efforts to achieve high availability because
complex system A complex system is a system composed of many components that may interact with one another. Examples of complex systems are Earth's global climate, organisms, the human brain, infrastructure such as power grid, transportation or communication sy ...
s inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth the theory that the most highly available systems adhere to a simple architecture (a single, high-quality, multi-purpose physical system with comprehensive internal hardware redundancy), this architecture suffers from the requirement that the entire system must be brought down for patching and operating system upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing and
failover Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer ...
). High availability requires less human intervention to restore operation in complex systems; the reason for this being that the most common cause for outages is human error.


High availability through redundancy

On the other hand, redundancy is used to create systems with high levels of availability (e.g. popular ecommerce websites). In this case it is required to have high levels of failure detectability and avoidance of common cause failures. If redundant parts are used in parallel and have independent failure (e.g. by not being within the same data center), they can exponentially increase the availability and make the overall system highly available. If you have N parallel components each having X availability, then you can use following formula: Availability of parallel components = 1 - (1 - X)^ N So for example if each of your components has only 50% availability, by using 10 of components in parallel, you can achieve 99.9023% availability. Two kinds of redundancy are passive redundancy and active redundancy. Passive redundancy is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving
electric power transmission Electric power transmission is the bulk movement of electrical energy from a generating site, such as a power plant, to an electrical substation. The interconnected lines that facilitate this movement form a ''transmission network''. This is ...
. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system. Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systems that are linked. Internet
routing Routing is the process of selecting a path for traffic in a Network theory, network or between or across multiple networks. Broadly, routing is performed in many types of networks, including circuit-switched networks, such as the public switched ...
is derived from early work by Birman and Joseph in this area. Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic. Zero downtime system design means that modeling and simulation indicates mean time between failures significantly exceeds the period of time between
planned maintenance The technical meaning of maintenance involves functional checks, servicing, repairing or replacing of necessary devices, equipment, machinery, building infrastructure and supporting utilities in industrial, business, and residential installat ...
,
upgrade Upgrading is the process of replacing a product with a newer version of the same product. In computing and consumer electronics, an upgrade is generally a replacement of hardware, software or firmware with a newer or better version, in order to ...
events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of
communications satellite A communications satellite is an artificial satellite that relays and amplifies radio telecommunication signals via a Transponder (satellite communications), transponder; it creates a communication channel between a source transmitter and a Rad ...
s.
Global Positioning System The Global Positioning System (GPS) is a satellite-based hyperbolic navigation system owned by the United States Space Force and operated by Mission Delta 31. It is one of the global navigation satellite systems (GNSS) that provide ge ...
is an example of a zero downtime system. Fault
instrumentation Instrumentation is a collective term for measuring instruments, used for indicating, measuring, and recording physical quantities. It is also a field of study about the art and science about making measurement instruments, involving the related ...
can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of downtime only after a fault indicator activates. Failure is only significant if this occurs during a
mission critical A mission critical (also mission essential) factor of a system is any factor (component, equipment, personnel, process, procedure, software, etc.) that is essential to business, organizational, or governmental operations. Failure or disruption o ...
period.
Modeling and simulation Modeling and simulation (M&S) is the use of models (e.g., physical, mathematical, behavioral, or logical representation of a system, entity, phenomenon, or process) as a basis for simulations to develop data utilized for managerial or technica ...
is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously.


Reasons for unavailability

A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems. All reasons refer to not following best practice in each of the following areas (in order of importance): # Monitoring of the relevant components #
Requirements In engineering, a requirement is a condition that must be satisfied for the output of a work effort to be acceptable. It is an explicit, objective, clear and often quantitative description of a condition to be satisfied by a material, design, pro ...
and procurement # Operations # Avoidance of network failures # Avoidance of internal application failures # Avoidance of external services that fail # Physical environment # Network redundancy # Technical solution of backup # Process solution of backup # Physical location # Infrastructure redundancy # Storage architecture redundancy A book on the factors themselves was published in 2003.


Costs of unavailability

In a 1998 report from
IBM Global Services IBM Consulting, rebranded in 2021 from IBM Global Business Services, is the professional services and consulting arm of IBM. It provides services to companies, global government organizations, Nonprofit organization, non-profits and Non-government ...
, unavailable systems were estimated to have cost American businesses $4.54 billion in 1996, due to lost productivity and revenues.IBM Global Services, ''Improving systems availability'', IBM Global Services, 1998


See also

*
Availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
*
Fault tolerance Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems. Fault t ...
*
High-availability cluster In computing, high-availability clusters (HA clusters) or fail-over clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability sof ...
*
Overall equipment effectiveness Overall equipment effectiveness (OEE) is a measure of how well a manufacturing equipment is utilized compared to its full potential, during the periods when it is scheduled to run. It identifies the percentage of manufacturing time that is truly ...
*
Reliability, availability and serviceability Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The p ...
*
Responsiveness Responsiveness as a concept of computer science refers to the specific ability of a system or functional unit to complete assigned tasks within a given time. For example, it would refer to the ability of an artificial intelligence system to und ...
*
Scalability Scalability is the property of a system to handle a growing amount of work. One definition for software systems specifies that this may be done by adding resources to the system. In an economic context, a scalable business model implies that ...
*
Ubiquitous computing Ubiquitous computing (or "ubicomp") is a concept in software engineering, hardware engineering and computer science where computing is made to appear seamlessly anytime and everywhere. In contrast to desktop computing, ubiquitous computing imp ...


Notes


References


External links


Lecture Notes on Enterprise Computing
University of Tübingen

by Prof. Phil Koopman
Uptime Calculator (SLA)
{{DEFAULTSORT:High Availability System administration Quality control Applied probability Reliability engineering Measurement Computer networks engineering