
A single point of failure (SPOF) is a part of a system that would
stop the entire system from working if it were to
fail. The term single point of failure implies that there is not a backup or redundant option that would enable the system to continue to function without it. SPOFs are undesirable in any system with a goal of
high availability
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
There is now more dependence on these systems as a result of modernization ...
or
reliability
Reliability, reliable, or unreliable may refer to:
Science, technology, and mathematics Computing
* Data reliability (disambiguation), a property of some disk arrays in computer storage
* Reliability (computer networking), a category used to des ...
, be it a business practice, software application, or other industrial system. If there is a SPOF present in a system, it produces a potential interruption to the system that is substantially more disruptive than an error would elsewhere in the system.
Overview
Systems can be made robust by adding
redundancy in all potential SPOFs. Redundancy can be achieved at various levels.
The assessment of a potential SPOF involves identifying the critical components of a complex system that would provoke a total systems failure in case of
malfunction
A malfunction is a state in which something functions incorrectly or is obstructed from functioning at all.
Some types of malfunctions are:
*Malfunction (parachuting)
A malfunction is a partial or total failure of a Parachute, parachuting de ...
. Highly
reliable systems should not rely on any such individual component.
For instance, the owner of a small
tree care company may only own one
woodchipper
A tree chipper or woodchipper is a machine used for reducing wood (generally tree limbs or trunks) into smaller woodchips. They are often portable, being mounted on wheels on frames suitable for towing behind a truck or van. Power is general ...
. If the chipper breaks, they may be unable to complete their current job and may have to cancel future jobs until they can obtain a replacement. The owner could prepare for this in multiple ways. The owner of the tree care company may have
spare part
A spare part, spare, service part, repair part, or replacement part, is an interchangeable part that is kept in an inventory and used for the repair or Refurbishment (electronics), refurbishment of defective equipment/units. Spare parts are an i ...
s ready for the repair of the wood chipper, in case it fails. At a higher level, they may have a second wood chipper that they can bring to the job site. Finally, at the highest level, they may have enough equipment available to completely replace everything at the work site in the case of multiple failures.
File:Spof simple.svg, Possible SPOFs in a simple setup
File:Spof redundancy.svg, Using redundancy to avoid some SPOFs
File:Spof complex.svg, Completely redundant system without SPOFs (note: assumes generator and grid sources are each rated at N, each UPS is rated at N, and "A/C" and "Electrical" are in and of themselves completely fault tolerant systems)
Computing
A
fault-tolerant computer system
Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission critical, mission-critical, or even life-critical sys ...
can be achieved at the internal component level, at the system level (multiple machines), or site level (replication).
One would normally deploy a
load balancer
In computing, load balancing is the process of distributing a set of tasks over a set of resources
''Resource'' refers to all the materials available in our environment which are Technology, technologically accessible, Economics, economically ...
to ensure high availability for a
server cluster at the system level. In a high-availability server cluster, each individual server may attain internal component redundancy by having multiple power supplies, hard drives, and other components. System-level redundancy could be obtained by having spare servers waiting to take on the work of another server if it fails.
Since a data center is often a support center for other operations such as business logic, it represents a potential SPOF in itself. Thus, at the site level, the entire cluster may be replicated at another location, where it can be accessed in case the primary location becomes unavailable. This is typically addressed as part of an
IT disaster recovery
IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, an ...
program. While previously the solution to this SPOF was physical duplication of clusters, the high demand for this duplication led multiple businesses to outsource duplication to 3rd parties using
cloud computing
Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...
. It has been argued by scholars, however, that doing so simply moves the SPOF and may even increase the likelihood of a failure or
cyberattack
A cyberattack (or cyber attack) occurs when there is an unauthorized action against computer infrastructure that compromises the confidentiality, integrity, or availability of its content.
The rising dependence on increasingly complex and inte ...
.
[Lever, Kirsty E., Madjid Merabti, and Kashif Kifayat]
"Single Points of Failure Within Systems-of-Systems."
''14th Annual Post Graduate Symposium on the Convergence of Telecommunications, Networking and Broadcasting (PGNet)''. Vol. 183. 2013.
Paul Baran
Paul Baran (born Pesach Baran ; April 29, 1926 – March 26, 2011) was a Polish-American engineer who was a pioneer in the development of computer networks. He was one of the two independent inventors of packet switching, which is today the do ...
and
Donald Davies
Donald Watts Davies, (7 June 1924 – 28 May 2000) was a Welsh computer scientist and Internet pioneer who was employed at the UK National Physical Laboratory (NPL).
During 1965-67 he invented modern data communications, including packet s ...
developed
packet switching
In telecommunications, packet switching is a method of grouping Data (computing), data into short messages in fixed format, i.e. ''network packet, packets,'' that are transmitted over a digital Telecommunications network, network. Packets consi ...
, a key part of "survivable communications networks". Such networks including
ARPANET
The Advanced Research Projects Agency Network (ARPANET) was the first wide-area packet-switched network with distributed control and one of the first computer networks to implement the TCP/IP protocol suite. Both technologies became the tec ...
and the
Internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
are designed to have no single point of failure. Multiple paths between any two points on the network allow those points to continue communicating with each other, the packets
"routing around" damage, even after any single failure of any one particular path or any one intermediate node.
Software engineering
In
software engineering
Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...
, a bottleneck occurs when the capacity of an
application or a computer system is limited by a single component. The bottleneck has lowest throughput of all parts of the transaction path. A common example is when a used
programming language
A programming language is a system of notation for writing computer programs.
Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
is capable of
parallel processing, but a given
snippet of code has several independent processes run sequentially rather than simultaneously.
Performance engineering
Tracking down bottlenecks (sometimes known as ''hot spots'' – sections of the code that execute most frequently – i.e., have the highest execution count) is called
performance analysis. Reduction is usually achieved with the help of specialized tools, known as performance analyzers or profilers. The objective is to make those particular sections of code perform as fast as possible to improve overall
algorithmic efficiency
In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repea ...
.
Computer security
A vulnerability or security exploit in just one component can compromise an entire system. One of the largest concerns in
computer security
Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and computer network, n ...
is attempting to eliminate SPOFs without sacrificing too much convenience to the user. With the invention and popularization of the
Internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
, several systems became connected to the broader world through many difficult to secure connections.
While companies have developed a number of solutions to this, the most consistent form of SPOFs in complex systems tends to remain
user error, either by accidental mishandling by an operator or outside interference through
phishing
Phishing is a form of social engineering and a scam where attackers deceive people into revealing sensitive information or installing malware such as viruses, worms, adware, or ransomware. Phishing attacks have become increasingly sophisticate ...
attacks.
Other fields
The concept of a single point of failure has also been applied to fields outside of engineering, computers, and networking, such as corporate
supply chain
A supply chain is a complex logistics system that consists of facilities that convert raw materials into finished products and distribute them to end consumers or end customers, while supply chain management deals with the flow of goods in distri ...
management and transportation management.
["Crucial, Century-Old, And Sometimes Stuck: Connecticut Bridge Is Key To Northeast Corridor"]
Connecticut Public Radio, August 8, 2017.
Design structures that create single points of failure include
bottleneck
Bottleneck may refer to:
* the narrowed portion (neck) of a bottle
Science and technology
* Bottleneck (engineering), where the performance of an entire system is limited by a single component
* Bottleneck (network), in a communication network
* ...
s and
series circuit
Terminal (electronics), Two-terminal components and electrical networks can be connected in series or parallel. The resulting electrical network will have two terminals, and itself can participate in a series or parallel Topology (electrical ci ...
s (in contrast to
parallel circuit
Terminal (electronics), Two-terminal components and electrical networks can be connected in series or parallel. The resulting electrical network will have two terminals, and itself can participate in a series or parallel Topology (electrical ci ...
s).
In transportation, some noted recent examples of the concept's application have included the
Nipigon River Bridge in Canada, where a partial bridge failure in January 2016 entirely severed road traffic between
Eastern Canada
Eastern Canada (, also the Eastern provinces, Canadian East or the East) is generally considered to be the region of Canada south of Hudson Bay/ Hudson Strait and east of Manitoba, consisting of the following provinces (from east to west): Newf ...
and
Western Canada
Western Canada, also referred to as the Western provinces, Canadian West, or Western provinces of Canada, and commonly known within Canada as the West, is a list of regions of Canada, Canadian region that includes the four western provinces and t ...
for several days because it is located along a portion of the
Trans-Canada Highway
The Trans-Canada Highway (Canadian French, French: ; abbreviated as the TCH or T-Can) is a transcontinental federal–provincial highway system that travels through all ten provinces of Canada, from the Pacific Ocean on the west coast to the A ...
where there is no alternate
detour
__NOTOC__
A detour or (British English: diversion) is a (normally temporary) route taking traffic around an area of prohibited or reduced access, such as a construction site. Standard operating procedure for many roads departments is to route an ...
route for vehicles to take; and the
Norwalk River Railroad Bridge in
Norwalk,
Connecticut
Connecticut ( ) is a U.S. state, state in the New England region of the Northeastern United States. It borders Rhode Island to the east, Massachusetts to the north, New York (state), New York to the west, and Long Island Sound to the south. ...
, an aging
swing bridge
A swing bridge (or swing span bridge) is a movable bridge that can be rotated horizontally around a vertical axis. It has as its primary structural support a vertical locating pin and support ring, usually at or near to its center of gravit ...
that sometimes gets stuck when opening or closing, disrupting rail traffic on the
Northeast Corridor
The Northeast Corridor (NEC) is an electrified railroad line in the Northeast megalopolis of the United States. Owned primarily by Amtrak, it runs from Boston in the north to Washington, D.C., in the south, with major stops in Providence, Rhod ...
line.
[
The concept of a single point of failure has also been applied to the fields of intelligence. ]Edward Snowden
Edward Joseph Snowden (born June 21, 1983) is a former National Security Agency (NSA) intelligence contractor and whistleblower who leaked classified documents revealing the existence of global surveillance programs.
Born in 1983 in Elizabeth ...
talked of the dangers of being what he described as "the single point of failure" – the sole repository of information.
Life-support systems
A component of a life-support system
A life-support system is the combination of equipment that allows survival in an environment or situation that would not support that life in its absence. It is generally applied to systems supporting human life in situations where the outside ...
that would constitute a single point of failure would be required to be extremely reliable.
See also
Concepts
*
*
*
*
*
*
Applications
*
*
*
*
*
In literature
*
*
References
{{Reflist
Engineering failures
Systems engineering
Reliability engineering
Fault-tolerant computer systems
Network architecture