A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of

high availability High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. Modernization has resulted in an increased reliance on these systems. Fo ...

reliability Reliability, reliable, or unreliable may refer to: Science, technology, and mathematics Computing * Data reliability (disambiguation), a property of some disk arrays in computer storage * High availability * Reliability (computer networking), a ...

, be it a business practice, software application, or other industrial system.

Overview

Systems can be made robust by adding redundancy in all potential SPOFs. Redundancy can be achieved at various levels. The assessment of a potential SPOF involves identifying the critical components of a complex system that would provoke a total systems failure in case of

malfunction A malfunction is a state in which something functions incorrectly or is obstructed from functioning at all. Some types of malfunctions are: * Malfunction (parachuting), malfunction of a parachute * Sexual malfunction, also called "sexual dysfunc ...

. Highly reliable systems should not rely on any such individual component. For instance, the owner of a small

tree care Tree care is the application of arboricultural methods like pruning, trimming, and felling/thinning in built environments. Road verge, Greenway (landscape), greenways, backyard and park woody vegetation are at the center of attention for the tree ...

company may only own one

woodchipper A tree chipper or woodchipper is a machine used for reducing wood (generally tree limbs or trunks) into smaller woodchips. They are often portable, being mounted on wheels on frames suitable for towing behind a truck or van. Power is general ...

. If the chipper breaks, he may be unable to complete his current job and may have to cancel future jobs until he can obtain a replacement. The owner of the tree care company may have

spare part A spare part, spare, service part, repair part, or replacement part, is an interchangeable part that is kept in an inventory and used for the repair or refurbishment of defective equipment/units. Spare parts are an important feature of logistic ...

s ready for the repair of the wood chipper, in case it fails. At a higher level, he may have a second wood chipper that he can bring to the job site. Finally, at the highest level, he may have enough equipment available to completely replace everything at the work site in the case of multiple failures. File:Spof_simple.svg, Possible SPOFs in a simple setup. File:Spof_redundancy.svg, Using redundancy to avoid some SPOFs. File:Spof_complex.svg, Completely redundant system without SPOFs.(Note: Assumes Generator and Grid sources are each rated at N, Each UPS is rated at N and "A/C" and "Electrical" are in themselves completely fault tolerant systems.

Computing

fault-tolerant computer system Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...

can be achieved at the internal component level, at the system level (multiple machines), or site level (replication). One would normally deploy a load balancer to ensure high availability for a server cluster at the system level. In a high-availability server cluster, each individual server may attain internal component redundancy by having multiple power supplies, hard drives, and other components. System-level redundancy could be obtained by having spare servers waiting to take on the work of another server if it fails. Since a data center is often a support center for other operations such as business logic, it represents a potential SPOF in itself. Thus, at the site level, the entire cluster may be replicated at another location, where it can be accessed in case the primary location becomes unavailable. This is typically addressed as part of an IT disaster recovery (resiliency) program.

Paul Baran Paul Baran (born Pesach Baran ; April 29, 1926 – March 26, 2011) was a Polish-American engineer who was a pioneer in the development of computer networks. He was one of the two independent inventors of packet switching, which is today the dom ...

and

Donald Davies Donald Watts Davies, (7 June 1924 – 28 May 2000) was a Welsh computer scientist who was employed at the UK National Physical Laboratory (NPL). In 1965 he conceived of packet switching, which is today the dominant basis for data communic ...

developed

packet switching In telecommunications, packet switching is a method of grouping Data (computing), data into ''network packet, packets'' that are transmitted over a digital Telecommunications network, network. Packets are made of a header (computing), header and ...

, a key part of "survivable communications networks". Such networks including

ARPANET The Advanced Research Projects Agency Network (ARPANET) was the first wide-area packet-switched network with distributed control and one of the first networks to implement the TCP/IP protocol suite. Both technologies became the technical fou ...

and the

Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...

are designed to have no single point of failure. Multiple paths between any two points on the network allow those points to continue communicating with each other, the packets "routing around" damage, even after any single failure of any one particular path or any one intermediate node.

Software engineering

software engineering Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term '' ...

, a bottleneck occurs when the capacity of an application or a computer system is limited by a single component. The bottleneck has lowest throughput of all parts of the transaction path.

Performance engineering

Tracking down bottlenecks (sometimes known as ''hot spots'' – sections of the code that execute most frequently – i.e., have the highest execution count) is called performance analysis. Reduction is usually achieved with the help of specialized tools, known as performance analyzers or profilers. The objective is to make those particular sections of code perform as fast as possible to improve overall

algorithmic efficiency In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. An algorithm must be analyzed to determine its resource usage, and the efficiency of an algor ...

Computer security

A vulnerability or security exploit in just one component can compromise an entire system.

Other fields

The concept of a single point of failure has also been applied to fields outside of engineering, computers, and networking, such as corporate

supply chain In commerce, a supply chain is a network of facilities that procure raw materials, transform them into intermediate goods and then final products to customers through a distribution system. It refers to the network of organizations, people, acti ...

management and transportation management."Crucial, Century-Old, And Sometimes Stuck: Connecticut Bridge Is Key To Northeast Corridor"

Connecticut Public Radio Connecticut Public Radio is a network of public radio stations in the state of Connecticut, western Massachusetts, and eastern Long Island, affiliated with NPR (National Public Radio). It is owned by Connecticut Public Broadcasting Network, whi ...

, August 8, 2017. Design structures that create single points of failure include

bottleneck Bottleneck literally refers to the narrowed portion (neck) of a bottle near its opening, which limit the rate of outflow, and may describe any object of a similar shape. The literal neck of a bottle was originally used to play what is now known as ...

s and

series circuit Two-terminal components and electrical networks can be connected in series or parallel. The resulting electrical network will have two terminals, and itself can participate in a series or parallel topology. Whether a two-terminal "object" is an ...

s (in contrast to

parallel circuit Terminal (electronics), Two-terminal components and electrical networks can be connected in series or parallel. The resulting electrical network will have two terminals, and itself can participate in a series or parallel Topology (electrical cir ...

s). In transportation, some noted recent examples of the concept's recent application have included the

Nipigon River Bridge The Nipigon River Bridge is a cable-stayed bridge in Canada carrying Highway 11 and Highway 17, designated as part of the Trans-Canada Highway, across the Nipigon River near Nipigon, Ontario. History A steel deck truss road bridge was ...

in Canada, where a partial bridge failure in January 2016 entirely severed road traffic between

Eastern Canada Eastern Canada (also the Eastern provinces or the East) is generally considered to be the region of Canada south of the Hudson Bay/Strait and east of Manitoba, consisting of the following provinces (from east to west): Newfoundland and Labrador, ...

and

Western Canada Western Canada, also referred to as the Western provinces, Canadian West or the Western provinces of Canada, and commonly known within Canada as the West, is a Canadian region that includes the four western provinces just north of the Canada� ...

for several days because it is located along a portion of the

Trans-Canada Highway The Trans-Canada Highway ( French: ; abbreviated as the TCH or T-Can) is a transcontinental federal–provincial highway system that travels through all ten provinces of Canada, from the Pacific Ocean on the west coast to the Atlantic Ocean o ...

where there is no alternate

detour __NOTOC__ A detour or (British English: diversion) is a (normally temporary) route taking traffic around an area of prohibited or reduced access, such as a construction site. Standard operating procedure for many roads departments is to route an ...

route for vehicles to take; and the

Norwalk River Railroad Bridge The Norwalk River Railroad Bridge (also known as the Walk Bridge) is a swing bridge built in 1896 for the New York, New Haven and Hartford Railroad. It currently carries Amtrak and Metro-North Railroad trains over the Norwalk River. The current ...

in Norwalk,

Connecticut Connecticut () is the southernmost state in the New England region of the Northeastern United States. It is bordered by Rhode Island to the east, Massachusetts to the north, New York to the west, and Long Island Sound to the south. Its cap ...

, an aging

swing bridge A swing bridge (or swing span bridge) is a movable bridge that has as its primary structural support a vertical locating pin and support ring, usually at or near to its center of gravity, about which the swing span (turning span) can then pi ...

that sometimes gets stuck when opening or closing, disrupting rail traffic on the

Northeast Corridor The Northeast Corridor (NEC) is an electrified railroad line in the Northeast megalopolis of the United States. Owned primarily by Amtrak, it runs from Boston through Providence, New Haven, Stamford, New York City, Philadelphia, Wilmington, a ...

line. The concept of a single point of failure has also been applied to the fields of intelligence.

Edward Snowden Edward Joseph Snowden (born June 21, 1983) is an American and naturalized Russian former computer intelligence consultant who leaked highly classified information from the National Security Agency (NSA) in 2013, when he was an employee and su ...

talked of the dangers of being what he described as "the single point of failure" – the sole repository of information.

Life-support systems

A component of a

life-support system A life-support system is the combination of equipment that allows survival in an environment or situation that would not support that life in its absence. It is generally applied to systems supporting human life in situations where the outsid ...

that would constitute a single point of failure would be required to be extremely reliable.

References

{{Underwater diving, divsaf Engineering failures Systems engineering Reliability engineering Fault-tolerant computer systems Network architecture

Overview

Computing

Software engineering

Performance engineering

Computer security

Other fields

Life-support systems

See also

Concepts

Applications

In literature

References