In computing and telecommunications, downtime (also (system) outage or (system) drought colloquially) is a period when a system is unavailable. The unavailability is the proportion of a time-span that a
system
A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its open system (systems theory), environment, is described by its boundaries, str ...
is unavailable or
offline
In computer technology and telecommunications, online indicates a state of connectivity, and offline indicates a disconnected state. In modern terminology, this usually refers to an Internet connection, but (especially when expressed as "on li ...
.
This is usually a result of the system
failing to function because of an unplanned event, or because of routine
maintenance
The technical meaning of maintenance involves functional checks, servicing, repairing or replacing of necessary devices, equipment, machinery, building infrastructure and supporting utilities in industrial, business, and residential installa ...
(a planned event).
The terms are commonly applied to
networks and
servers. The common reasons for unplanned outages are system failures (such as a
crash) or communications failures (commonly known as network outage or network drought colloquially). For outages due to issues with general
computer systems
A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...
, the term computer outage (also IT outage or IT drought) can be used.
The term is also commonly applied in industrial environments in relation to failures in industrial production equipment. Some facilities measure the downtime incurred during a work shift, or during a 12- or 24-hour period. Another common practice is to identify each downtime event as having an operational, electrical or mechanical origin.
The opposite of downtime is
uptime
Uptime is a Measurement, measure of system reliability, expressed as the period of system time, time a machine, typically a computer, has been continuously working and available. Uptime is the opposite of downtime.
It is often used as a measure ...
.
Types
Industry standards for the term "Outage Duration" or "Maintenance Duration" can have different point of initiation and completion thus the following clarification should be used to avoid conflicts in contract execution:
# "Turnkey" this is the most engrossing of all outage types. Outage or Maintenance starts with operator of the plant or equipment pressing the shutdown or stop button to initiate a halt in operation. Unless otherwise noted, Outage or Maintenance is considered completed when the plant or equipment is back in normal operation ready to begin manufacturing or ready be synchronized with system or grid or ready to perform duties as pump or compressor.
# "Breaker to Breaker" This Outage or Maintenance starts with operator of the plant or equipment removing the power circuit (Main power breaker at "off" or "disengaged" or "On-Cooldown"), not the control circuit from operation. This still would allow for the equipment to be cooled down or brought to ambient such that outage/maintenance work can be prepared or initiated. Depending on equipment types, "Breaker to Breaker" outage can be advantageous if contracting out controls related maintenance as this type of maintenance work can be performed while main equipment is still on cool-down or on stand-by. Unless otherwise noted, this type of outage is considered complete when power circuit is re-energized via engaging of the power breaker.
# "Completion of
Lock-out/Tag-out" This Outage or Maintenance (sometimes mistaken for "Off-Cooldown" but not the same) starts with operator of the plant or equipment removing the power circuit, disengaging the control circuit and performing other neutralization of potential power and hazard sources (typically called Lock-Out, Tag-Out "LOTO") This point of maintenance period is typically the last phase of the outage initiation stage before actual work starts on the facility, plant or equipment. Safety briefing should always follow the LOTO activity, before any work is conducted. Unless otherwise noted, this type of outage is considered complete when the equipment has reached mechanical completion and ready to be placed on slow-roll for many heavy rotating equipment, Bump-test or rotation check for motors, etc., but must follow return or work permit per LOTO procedures.
Any on-line testing, performance testing and tuning required should not count towards the outage duration as these activities are typically conducted after the completion of outage or maintenance event and are out of control of most maintenance contractors.
Characteristics
Unplanned downtime may be the result of an equipment malfunction, etc.
Telecommunication outage classifications
Downtime can be caused by failure in
hardware (physical equipment),
(logic controlling equipment),
interconnecting equipment (such as cables, facilities, routers,...),
transmission (wireless, microwave, satellite), and/or
capacity (system limits).
The failures can occur because of
damage,
failure,
design,
procedural (improper use by humans),
engineering (how to use and deployment),
overload (traffic or system resources stressed beyond designed limits),
environment (support systems like power and HVAC),
(outages designed into the system for a purpose such as software upgrades and equipment growth),
other (none of the above but known), or
unknown.
The failures can be the responsibility of
customer/service provider,
vendor/supplier,
utility,
government,
contractor,
end customer,
public individual,
act of nature,
other (none of the above but known), or
unknown.
Impact
Outages caused by system failures can have a serious impact on the users of computer/network systems, in particular those industries that rely on a nearly 24-hour service:
*
Medical informatics
Health informatics combines communications, information technology (IT), and health care to enhance patient care and is at the forefront of the medical technological revolution. It can be viewed as a branch of engineering and applied science.
...
*
Nuclear power
Nuclear power is the use of nuclear reactions to produce electricity. Nuclear power can be obtained from nuclear fission, nuclear decay and nuclear fusion reactions. Presently, the vast majority of electricity from nuclear power is produced by ...
and other
infrastructure
Infrastructure is the set of facilities and systems that serve a country, city, or other area, and encompasses the services and facilities necessary for its economy, households and firms to function. Infrastructure is composed of public and pri ...
*
Bank
A bank is a financial institution that accepts Deposit account, deposits from the public and creates a demand deposit while simultaneously making loans. Lending activities can be directly performed by the bank or indirectly through capital m ...
s and other
financial institution
A financial institution, sometimes called a banking institution, is a business entity that provides service as an intermediary for different types of financial monetary transactions. Broadly speaking, there are three major types of financial ins ...
s
*
Aeronautics
Aeronautics is the science or art involved with the study, design process, design, and manufacturing of air flight-capable machines, and the techniques of operating aircraft and rockets within the atmosphere.
While the term originally referred ...
,
airline
An airline is a company that provides civil aviation, air transport services for traveling passengers or freight (cargo). Airlines use aircraft to supply these services and may form partnerships or Airline alliance, alliances with other airlines ...
s
*
News reporting
*
E-commerce
E-commerce (electronic commerce) refers to commercial activities including the electronic buying or selling products and services which are conducted on online platforms or over the Internet. E-commerce draws on technologies such as mobile co ...
and
online transaction processing
Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to the fact that such systems are expected to respond to user requests and process them i ...
*
Persistent online games
Also affected can be the users of an
ISP and other customers of a telecommunication network.
Corporations can lose business due to network outage or they may default on a contract, resulting in financial losses. According to
Veeam 2019 cloud data management report organizations encounter unplanned downtime, on average, 5-10 times per year with the average cost of one hour of downtime being $102,450.
Those people or organizations that are affected by downtime can be more sensitive to particular aspects:
* some are more affected by the length of an outage - it matters to them how much time it takes to recover from a problem
* others are sensitive to the timing of an outage - outages during peak hours affect them the most
The most demanding users are those that require
high availability
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
There is now more dependence on these systems as a result of modernization ...
.
Famous outages
On
Mother's Day
Mother's Day is a celebration honoring the mother of the family or individual, as well as motherhood, maternal bonds, and the influence of mothers in society. It is celebrated on different days in many parts of the world, most commonly in Mar ...
, Sunday, May 8, 1988, a fire broke out in the main switching room of the Hinsdale Central Office of the
Illinois Bell
Illinois Bell Telephone Company, LLC is the Bell Operating Company serving Illinois. It is owned by AT&T through AT&T Teleholdings, formerly Ameritech.
Their headquarters are at 225 West Randolph St., Chicago, IL. After the 1984 Bell System ...
telephone company. One of the largest
switching systems in the state, the facility processed more than 3.5 million calls each day while serving 38,000 customers, including numerous businesses, hospitals, and Chicago's O'Hare and Midway Airports.
Virtually the entire
AT&T
AT&T Inc., an abbreviation for its predecessor's former name, the American Telephone and Telegraph Company, is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the w ...
network of
4ESS
The No. 4 Electronic Switching System (4ESS) is a class 4 telephone electronic switching system that was the first digital electronic toll switch introduced by Western Electric for long-distance switching. It was introduced in Chicago in January ...
toll tandems switches went in and out of service over and over again on January 15, 1990, disrupting long-distance service for the entire United States. The problem dissipated by itself when traffic slowed down. A software bug was found.
AT&T lost its
Frame Relay
Frame Relay (FR) is a standardized wide area network (WAN) technology that specifies the Physical layer, physical and data link layers of digital telecommunications channels using a packet switching methodology.
Frame Relay was originally devel ...
network for 26 hours on April 13, 1998. This affected many thousands of customers, and bank transactions were one casualty. AT&T failed to meet the
service level agreement
A service-level agreement (SLA) is an agreement between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user.
T ...
on their contracts with customers and had to refund 6,600
customer accounts, costing millions of dollars.
Xbox Live
The Xbox network, formerly known and commonly referred to as Xbox Live, is an online multiplayer gaming and digital media delivery service created and operated by Microsoft Gaming for the Xbox brand. It was first made available to the origina ...
had intermittent downtime during the 2007–2008 holiday season which lasted thirteen days. Increased demand from Xbox 360 purchasers (the largest number of new user sign-ups in the history of Xbox Live) was given as the reason for the downtime; in order to make amends for the service issues, Microsoft offered their users the opportunity to receive a free game.
Sony
is a Japanese multinational conglomerate (company), conglomerate headquartered at Sony City in Minato, Tokyo, Japan. The Sony Group encompasses various businesses, including Sony Corporation (electronics), Sony Semiconductor Solutions (i ...
's
PlayStation Network April 2011 outage, began on April 20, 2011, and was gradually restored on May 14, 2011, starting in the
United States
The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic of 50 U.S. state, states and a federal capital district, Washington, D.C. The 48 ...
. This outage is the longest amount of time the PSN has been offline since its inception in 2006. Sony has stated the problem was caused by an external intrusion which resulted in the confiscation of personal information. Sony reported on April 26, 2011, that a large amount of user data had been obtained by the same hack that resulted in the downtime.
Telstra
Telstra Group Limited is an Australian telecommunications company that builds and operates telecommunications networks and markets related products and services. It is a member of the S&P/ASX 20 stock index, and is Australia's largest telecomm ...
's Ryde switch failed in late 2011 after water egressed into the electrical switch board from continuing wet weather. The Ryde switch is one of the largest by area switches in Australia, and affected more than 720,000 services.
The
Miami
Miami is a East Coast of the United States, coastal city in the U.S. state of Florida and the county seat of Miami-Dade County, Florida, Miami-Dade County in South Florida. It is the core of the Miami metropolitan area, which, with a populat ...
datacenter of ServerAxis went offline unannounced on February 29, 2016, and was never restored. This impacted multiple providers and hundreds of websites. The outage impacted coverage of the
2016 NCAA Division I women's basketball tournament as WBBState, one of the affected sites, was by far the most comprehensive provider of women's basketball statistics available.
The game platform
Roblox
Roblox (, ) is an online game platform and game creation system developed by Roblox Corporation that allows users to program and play games created by themselves or other users. It was created by David Baszucki and Erik Cassel in 200 ...
had an outage around October 2021, during their
Chipotle
A chipotle ( , ), or chilpotle, is a smoke-dried ripe jalapeño chili pepper used for seasoning. It is a chili used primarily in Mexican and Mexican-inspired cuisines, such as Tex-Mex and Southwestern United States dishes. It comes in differen ...
Event. Many users thought it was because of the event, because it received massive reception, as users could get a free Chipotle burrito during it. The outage was Roblox's longest downtime, lasting 3 days.
On July 8, 2022, Rogers suffered a
major nationwide outage in
Canada
Canada is a country in North America. Its Provinces and territories of Canada, ten provinces and three territories extend from the Atlantic Ocean to the Pacific Ocean and northward into the Arctic Ocean, making it the world's List of coun ...
. This simultaneously affected cell phone and internet access, causing 911 calls, interbank transactions to fail and also disrupting government services.
On July 19, 2024,
CrowdStrike
CrowdStrike Holdings, Inc. is an American cybersecurity technology company based in Austin, Texas. It provides endpoint security, threat intelligence, and cyberattack response services.
The company has been involved in investigations of seve ...
issued a
faulty device driver update for their Falcon software, resulting in Windows PCs, servers, and virtual machines to crash and boot loop. The incident unintentionally affected approximately 8.5 million
Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
machines worldwide, including critical infrastructure such as 911 services in various states. It is considered to be the largest outage in the history of
information technology
Information technology (IT) is a set of related fields within information and communications technology (ICT), that encompass computer systems, software, programming languages, data processing, data and information processing, and storage. Inf ...
.
Service levels
In
service level agreement
A service-level agreement (SLA) is an agreement between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user.
T ...
s, it is common to mention a percentage value (per month or per year) that is calculated by dividing the sum of all downtimes timespans by the total time of a reference time span (e.g. a month). 0% downtime means that the server was available all the time.
For Internet servers downtimes above 1% per year or worse can be regarded as unacceptable as this means a downtime of more than 3 days per year. For e-commerce and other industrial use any value above 0.1% is usually considered unacceptable.
Response and reduction of impact
It is the duty of the network designer to make sure that a network outage does not happen. When it does happen, a well-designed system will further reduce the effects of an outage by having localized outages which can be detected and fixed as soon as possible.
A process needs to be in place to detect a malfunction -
network monitoring
Network monitoring is the use of a system that constantly monitors a computer network for slow or failing components and that notifies the network administrator (via email, SMS or other alarms) in case of outages or other trouble. Network monitor ...
- and to restore the network to a working condition - this generally involves a
help desk
A help desk is a department or person that provides assistance and information, usually for electronic or computer problems. In the mid-1990s, research by Iain Middleton of Robert Gordon University studied the value of an organization's help des ...
team that can
troubleshoot
Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process ope ...
a problem, one composed of trained engineers; a separate help desk team is usually necessary in order to field user input, which can be particularly demanding during a downtime.
A
network management
Network management is the process of administering and managing computer networks. Services provided by this discipline include fault analysis, performance management, provisioning of networks and maintaining quality of service. Network managem ...
system can be used to detect faulty or degrading components prior to customer complaints, with proactive fault rectification.
Risk management
Risk management is the identification, evaluation, and prioritization of risks, followed by the minimization, monitoring, and control of the impact or probability of those risks occurring. Risks can come from various sources (i.e, Threat (sec ...
techniques can be used to determine the impact of network outages on an organisation and what actions may be required to minimise risk. Risk may be minimised by using reliable components, by performing maintenance, such as upgrades, by using
redundant systems or by having a
contingency plan
A contingency plan, or alternate plan, also known colloquially as Plan B, is a plan devised for an outcome other than in the usual (expected) plan. It is often used for risk management for an exceptional risk that, though unlikely, would have cata ...
or
business continuity plan.
Technical means can reduce errors with
error correcting code
In computing, telecommunication, information theory, and coding theory, forward error correction (FEC) or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels.
The centra ...
s,
retransmission,
checksum
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify dat ...
s, or
diversity scheme
In telecommunications, a diversity scheme refers to a method for improving the reliability of a message signal by using two or more communication channels with different characteristics. Diversity is mainly used in radio communication and is a c ...
.
One of the biggest causes of downtime is misconfiguration, where a planned change goes wrong. Typically organisations rely on manual effort to manage the process of configuration backups, but this requires highly skilled engineers with the time to manage the process across a multi-vendor network. Automation tools are available to manage backups, but there are very few solutions that handle configuration recovery which is needed to minimize the overall impact of the outage.
Planning
A planned outage is the result of a planned activity by the system owner and/or by a
service provider
A service provider (SP) is an organization that provides services, such as consulting, legal, real estate, communications, storage, and processing services, to other organizations. Although a service provider can be a sub-unit of the organization t ...
. These outages, often scheduled during the
maintenance window
In information technology and systems management, a maintenance window is a period of time designated in advance by the technical staff, during which preventive maintenance that could cause disruption of service may be performed.
High availabilit ...
, can be used to perform tasks including the following:
* Deferred maintenance, e.g., a deferred hardware repair or a deferred restart to clean up a garbled memory
* Diagnostics to isolate a detected fault
* Hardware fault repair
* Fixing an error or omission in a configuration database or omission in a recent configuration database change
* Fixing an error in application database or an error in a recent application database change
* Software patching/software updates to fix a software fault.
Outages can also be planned as a result of a predictable natural event, such as
Sun outage.
Maintenance downtimes have to be carefully scheduled in industries that rely on computer systems. In many cases, system-wide downtimes can be averted using what is called a "rolling upgrade" - the process of incrementally taking down parts of the system for upgrade, without affecting the overall functionality.
Avoidance
For most websites,
website monitoring
Website monitoring is the process of testing and verifying that end-users can interact with a website or web application as expected. Website monitoring are often used by businesses to ensure website uptime, performance, and functionality is as ex ...
is available. Website monitoring (synthetic or passive) is a service that "monitors" downtime and users on the site.
Other usage
Downtime can also refer to time when human capital or other assets go down. For instance, if employees are in meetings or unable to perform their work due to another constraint, they are down. This can be equally expensive, and can be the result of another asset (i.e. computer/systems) being down. This is also commonly known as "
dead time".
Downtime is also generalized in a personal sense, being used to refer to a period of
sleep
Sleep is a state of reduced mental and physical activity in which consciousness is altered and certain Sensory nervous system, sensory activity is inhibited. During sleep, there is a marked decrease in muscle activity and interactions with th ...
or
recreation
Recreation is an activity of leisure, leisure being discretionary time. The "need to do something for recreation" is an essential element of human biology and psychology. Recreational activities are often done for happiness, enjoyment, amusement, ...
.
This term is used also in factories or industrial use. See
total productive maintenance (TPM).
Measuring downtime
There are many external services which can be used to monitor the uptime and downtime as well as availability of a service or a host.
A notable example is that of
Downdetector, an online website owned by
Ookla which tracks regular downtime and major outages with user outage reports made in the site, which also includes the page for each website on Downdetector itself and Twitter.
It is currently available in 45 countries (with a different site in each country), and tracks 12,000 services internationally.
See also
*
High availability
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
There is now more dependence on these systems as a result of modernization ...
*
Uptime
Uptime is a Measurement, measure of system reliability, expressed as the period of system time, time a machine, typically a computer, has been continuously working and available. Uptime is the opposite of downtime.
It is often used as a measure ...
*
Mean down time
In organizational management, mean down time (MDT) is the average time that a system is non-operational. This includes all downtime associated with repair, corrective and preventive maintenance, self-imposed downtime, and any logistics or adminis ...
*
Planned downtime
*
Carrier grade
References
External links
*{{Wiktionary-inline
Engineering failures
Maintenance
System administration
it:Tempo di fermo