A single point of failure (SPOF) is a part of a system that would stop the entire system from working if it were to fail. The term single point of failure implies that there is not a backup or redundant option that would enable the system to continue to function without it. SPOFs are undesirable in any system with a goal of

high availability High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. There is now more dependence on these systems as a result of modernization ...

reliability Reliability, reliable, or unreliable may refer to: Science, technology, and mathematics Computing * Data reliability (disambiguation), a property of some disk arrays in computer storage * Reliability (computer networking), a category used to des ...

, be it a business practice, software application, or other industrial system. If there is a SPOF present in a system, it produces a potential interruption to the system that is substantially more disruptive than an error would elsewhere in the system.

Overview

Systems can be made robust by adding redundancy in all potential SPOFs. Redundancy can be achieved at various levels. The assessment of a potential SPOF involves identifying the critical components of a complex system that would provoke a total systems failure in case of

malfunction A malfunction is a state in which something functions incorrectly or is obstructed from functioning at all. Some types of malfunctions are: *Malfunction (parachuting) A malfunction is a partial or total failure of a Parachute, parachuting de ...

. Highly reliable systems should not rely on any such individual component. For instance, the owner of a small tree care company may only own one

woodchipper A tree chipper or woodchipper is a machine used for reducing wood (generally tree limbs or trunks) into smaller woodchips. They are often portable, being mounted on wheels on frames suitable for towing behind a truck or van. Power is general ...

. If the chipper breaks, they may be unable to complete their current job and may have to cancel future jobs until they can obtain a replacement. The owner could prepare for this in multiple ways. The owner of the tree care company may have

spare part A spare part, spare, service part, repair part, or replacement part, is an interchangeable part that is kept in an inventory and used for the repair or Refurbishment (electronics), refurbishment of defective equipment/units. Spare parts are an i ...

s ready for the repair of the wood chipper, in case it fails. At a higher level, they may have a second wood chipper that they can bring to the job site. Finally, at the highest level, they may have enough equipment available to completely replace everything at the work site in the case of multiple failures. File:Spof simple.svg, Possible SPOFs in a simple setup File:Spof redundancy.svg, Using redundancy to avoid some SPOFs File:Spof complex.svg, Completely redundant system without SPOFs (note: assumes generator and grid sources are each rated at N, each UPS is rated at N, and "A/C" and "Electrical" are in and of themselves completely fault tolerant systems)

Computing

fault-tolerant computer system Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission critical, mission-critical, or even life-critical sys ...

can be achieved at the internal component level, at the system level (multiple machines), or site level (replication). One would normally deploy a

load balancer In computing, load balancing is the process of distributing a set of tasks over a set of resources ''Resource'' refers to all the materials available in our environment which are Technology, technologically accessible, Economics, economically ...

to ensure high availability for a server cluster at the system level. In a high-availability server cluster, each individual server may attain internal component redundancy by having multiple power supplies, hard drives, and other components. System-level redundancy could be obtained by having spare servers waiting to take on the work of another server if it fails. Since a data center is often a support center for other operations such as business logic, it represents a potential SPOF in itself. Thus, at the site level, the entire cluster may be replicated at another location, where it can be accessed in case the primary location becomes unavailable. This is typically addressed as part of an

IT disaster recovery IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, an ...

program. While previously the solution to this SPOF was physical duplication of clusters, the high demand for this duplication led multiple businesses to outsource duplication to 3rd parties using

cloud computing Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...

. It has been argued by scholars, however, that doing so simply moves the SPOF and may even increase the likelihood of a failure or

cyberattack A cyberattack (or cyber attack) occurs when there is an unauthorized action against computer infrastructure that compromises the confidentiality, integrity, or availability of its content. The rising dependence on increasingly complex and inte ...

.Lever, Kirsty E., Madjid Merabti, and Kashif Kifayat
"Single Points of Failure Within Systems-of-Systems."
''14th Annual Post Graduate Symposium on the Convergence of Telecommunications, Networking and Broadcasting (PGNet)''. Vol. 183. 2013.

Paul Baran Paul Baran (born Pesach Baran ; April 29, 1926 – March 26, 2011) was a Polish-American engineer who was a pioneer in the development of computer networks. He was one of the two independent inventors of packet switching, which is today the do ...

and

Donald Davies Donald Watts Davies, (7 June 1924 – 28 May 2000) was a Welsh computer scientist and Internet pioneer who was employed at the UK National Physical Laboratory (NPL). During 1965-67 he invented modern data communications, including packet s ...

developed

packet switching In telecommunications, packet switching is a method of grouping Data (computing), data into short messages in fixed format, i.e. ''network packet, packets,'' that are transmitted over a digital Telecommunications network, network. Packets consi ...

, a key part of "survivable communications networks". Such networks including

ARPANET The Advanced Research Projects Agency Network (ARPANET) was the first wide-area packet-switched network with distributed control and one of the first computer networks to implement the TCP/IP protocol suite. Both technologies became the tec ...

and the

Internet The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...

are designed to have no single point of failure. Multiple paths between any two points on the network allow those points to continue communicating with each other, the packets "routing around" damage, even after any single failure of any one particular path or any one intermediate node.

Software engineering

software engineering Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining Application software, software applications. It involves applying engineering design process, engineering principl ...

, a bottleneck occurs when the capacity of an application or a computer system is limited by a single component. The bottleneck has lowest throughput of all parts of the transaction path. A common example is when a used

programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...

is capable of parallel processing, but a given snippet of code has several independent processes run sequentially rather than simultaneously.

Performance engineering

Tracking down bottlenecks (sometimes known as ''hot spots'' – sections of the code that execute most frequently – i.e., have the highest execution count) is called performance analysis. Reduction is usually achieved with the help of specialized tools, known as performance analyzers or profilers. The objective is to make those particular sections of code perform as fast as possible to improve overall

algorithmic efficiency In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repea ...

Computer security

A vulnerability or security exploit in just one component can compromise an entire system. One of the largest concerns in

computer security Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and computer network, n ...

is attempting to eliminate SPOFs without sacrificing too much convenience to the user. With the invention and popularization of the

, several systems became connected to the broader world through many difficult to secure connections. While companies have developed a number of solutions to this, the most consistent form of SPOFs in complex systems tends to remain user error, either by accidental mishandling by an operator or outside interference through

phishing Phishing is a form of social engineering and a scam where attackers deceive people into revealing sensitive information or installing malware such as viruses, worms, adware, or ransomware. Phishing attacks have become increasingly sophisticate ...

attacks.

Other fields

The concept of a single point of failure has also been applied to fields outside of engineering, computers, and networking, such as corporate

supply chain A supply chain is a complex logistics system that consists of facilities that convert raw materials into finished products and distribute them to end consumers or end customers, while supply chain management deals with the flow of goods in distri ...

management and transportation management."Crucial, Century-Old, And Sometimes Stuck: Connecticut Bridge Is Key To Northeast Corridor"
Connecticut Public Radio, August 8, 2017. Design structures that create single points of failure include

bottleneck Bottleneck may refer to: * the narrowed portion (neck) of a bottle Science and technology * Bottleneck (engineering), where the performance of an entire system is limited by a single component * Bottleneck (network), in a communication network * ...

s and

series circuit Terminal (electronics), Two-terminal components and electrical networks can be connected in series or parallel. The resulting electrical network will have two terminals, and itself can participate in a series or parallel Topology (electrical ci ...

s (in contrast to

parallel circuit Terminal (electronics), Two-terminal components and electrical networks can be connected in series or parallel. The resulting electrical network will have two terminals, and itself can participate in a series or parallel Topology (electrical ci ...

s). In transportation, some noted recent examples of the concept's application have included the Nipigon River Bridge in Canada, where a partial bridge failure in January 2016 entirely severed road traffic between

Eastern Canada Eastern Canada (, also the Eastern provinces, Canadian East or the East) is generally considered to be the region of Canada south of Hudson Bay/ Hudson Strait and east of Manitoba, consisting of the following provinces (from east to west): Newf ...

and

Western Canada Western Canada, also referred to as the Western provinces, Canadian West, or Western provinces of Canada, and commonly known within Canada as the West, is a list of regions of Canada, Canadian region that includes the four western provinces and t ...

for several days because it is located along a portion of the

Trans-Canada Highway The Trans-Canada Highway (Canadian French, French: ; abbreviated as the TCH or T-Can) is a transcontinental federal–provincial highway system that travels through all ten provinces of Canada, from the Pacific Ocean on the west coast to the A ...

where there is no alternate

detour __NOTOC__ A detour or (British English: diversion) is a (normally temporary) route taking traffic around an area of prohibited or reduced access, such as a construction site. Standard operating procedure for many roads departments is to route an ...

route for vehicles to take; and the Norwalk River Railroad Bridge in Norwalk,

Connecticut Connecticut ( ) is a U.S. state, state in the New England region of the Northeastern United States. It borders Rhode Island to the east, Massachusetts to the north, New York (state), New York to the west, and Long Island Sound to the south. ...

, an aging

swing bridge A swing bridge (or swing span bridge) is a movable bridge that can be rotated horizontally around a vertical axis. It has as its primary structural support a vertical locating pin and support ring, usually at or near to its center of gravit ...

that sometimes gets stuck when opening or closing, disrupting rail traffic on the

Northeast Corridor The Northeast Corridor (NEC) is an electrified railroad line in the Northeast megalopolis of the United States. Owned primarily by Amtrak, it runs from Boston in the north to Washington, D.C., in the south, with major stops in Providence, Rhod ...

line. The concept of a single point of failure has also been applied to the fields of intelligence.

Edward Snowden Edward Joseph Snowden (born June 21, 1983) is a former National Security Agency (NSA) intelligence contractor and whistleblower who leaked classified documents revealing the existence of global surveillance programs. Born in 1983 in Elizabeth ...

talked of the dangers of being what he described as "the single point of failure" – the sole repository of information.

Life-support systems

A component of a

life-support system A life-support system is the combination of equipment that allows survival in an environment or situation that would not support that life in its absence. It is generally applied to systems supporting human life in situations where the outside ...

that would constitute a single point of failure would be required to be extremely reliable.

References

{{Reflist Engineering failures Systems engineering Reliability engineering Fault-tolerant computer systems Network architecture

Overview

Computing

Software engineering

Performance engineering

Computer security

Other fields

Life-support systems

See also

Concepts

Applications

In literature

References