Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.
Concept
In software development, a given software system's ability to
tolerate failures while still ensuring adequate
quality of service
Quality of service (QoS) is the description or measurement of the overall performance of a service, such as a telephony or computer network, or a cloud computing service, particularly the performance seen by the users of the network. To quantitat ...
—often generalized as ''resiliency''—is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or lack of knowledge of the field. Chaos engineering is a technique to meet the resilience requirement.
Chaos engineering can be used to achieve resilience against infrastructure failures, network failures, and application failures.
Operational readiness using chaos engineering
Calculating how much confidence we can have in the interconnected complex systems those put into production environment requires operational readiness metrics. Operational readiness can be evaluated using chaos engineering simulations supported by
Kubernetes
Kubernetes (, commonly stylized as K8s) is an open-source container orchestration system for automating software deployment, scaling, and management. Google originally designed Kubernetes, but the Cloud Native Computing Foundation now maintai ...
infrastructure in
big data. Solutions for operational readiness of a platform stands for strengthening the backup, restore, network file transfer, failover capabilities and overall security. Gautam Siwach et al, performed evaluation of inducing
chaos
Chaos or CHAOS may refer to:
Arts, entertainment and media Fictional elements
* Chaos (''Kinnikuman'')
* Chaos (''Sailor Moon'')
* Chaos (''Sesame Park'')
* Chaos (''Warhammer'')
* Chaos, in ''Fabula Nova Crystallis Final Fantasy''
* Cha ...
to a Kubernetes environment which terminates random pods with data from edge devices in data centers while processing analytics on big data network and infer the recovery time of pods to calculate an estimated response time.
History
2003 – Amazon
While working to improve website reliability at
Amazon
Amazon most often refers to:
* Amazons, a tribe of female warriors in Greek mythology
* Amazon rainforest, a rainforest covering most of the Amazon basin
* Amazon River, in South America
* Amazon (company), an American multinational technolog ...
,
Jesse Robbins created "GameDay", an initiative that increases reliability by purposefully creating major failures on a regular basis. Robbins has said GameDay was inspired by firefighter training and research in other fields lessons in complex systems, reliability engineering.
2006 – Google
While at
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
,
Kripa Krishnan
Kripa ( sa, कृप, Kṛpa, pity), also known as Kripacharya ( sa, कृपाचार्य, Kṛpācārya, Kripa the master), is a figure in Hindu mythology. According to the epic ''Mahabharata'', he was a council member of Kuru Kingdom ...
created a similar program to
Amazon's GameDay called "DiRT".
2011 – Netflix
While overseeing
Netflix
Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a ...
's migration to the cloud in 2011
Nora Jones, Casey Rosenthal, and Greg Orzell
expanded the discipline while working together at Netflix by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:
"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."
By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.
The concept of chaos engineering is close to the one of Phoenix Servers, first introduced by
Martin Fowler in 2012.
Chaos engineering tools
Chaos Monkey
Chaos Monkey is a tool invented in 2011 by Netflix to test the
resilience
Resilience, resilient, resiliency, or ''variation'', may refer to:
Science
Ecology
* Ecological resilience, the capacity of an ecosystem to recover from perturbations
** Climate resilience, the ability of systems to recover from climate change
* ...
of its IT infrastructure.
It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.
The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license.
The name "Chaos Monkey" is explained in the book ''
Chaos Monkeys
''Chaos Monkeys: Obscene Fortune and Random Failure in Silicon Valley'' is an autobiography written by American tech entrepreneur Antonio García Martínez. The book compares Silicon Valley to the " chaos monkeys" of society. In the book, the au ...
'' by Antonio Garcia Martinez:
Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand .e. flings excrement The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.
Simian Army
The Simian Army
is a suite of tools developed by
Netflix
Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a ...
to test the reliability, security, or resiliency of its
Amazon Web Services
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
infrastructure and includes the following tools:
At the very top of the Simian Army hierarchy, Chaos Kong drops a full AWS "
Region
In geography, regions, otherwise referred to as zones, lands or territories, are areas that are broadly divided by physical characteristics (physical geography), human impact characteristics (human geography), and the interaction of humanity and t ...
". Though rare, loss of an entire region does happen and Chaos Kong simulates a systems response and recovery to this type of event.
Chaos Gorilla drops a full Amazon "
Availability Zone" (one or more entire data centers serving a geographical region).
Proofdock chaos engineering platform
Proofdock is a chaos engineering platform that focuses on and leverages the
Microsoft Azure platform and the
Azure DevOps services
Visual Studio is an integrated development environment (IDE) from Microsoft. It is used to develop computer programs including web site, websites, web apps, web services and mobile apps. Visual Studio uses Microsoft software development platfor ...
. Users can inject failures on the infrastructure, platform and application level.
Gremlin
Gremlin is a "failure-as-a-service" platform.
Facebook Storm
To prepare for the loss of a datacenter,
Facebook
Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin ...
regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.
Days of Chaos
Voyages-sncf.com created a "Day of Chaos" in 2017,
gamifying the simulation of pre-production failures. They presented their results at the 2017 DevOps REX conference.
See also
*
Fault injection
In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied phys ...
*
Fault tolerance
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
*
Fault-tolerant computer system
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
*
Data redundancy In computer main memory, auxiliary storage and computer buses, data redundancy is the existence of data that is additional to the actual data and permits correction of errors in stored or transmitted data. The additional data can simply be a compl ...
*
Error detection and correction
In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable comm ...
*
Fall back and forward {{for, the seasonal changing of clock, Daylight saving time
Fall back is a feature of a modem protocol in data communication whereby two communicating modems which experience data corruption (due to line noise, for example) can renegotiate with eac ...
*
Grease (networking)
*
Resilience (network)
In computer networking, resilience is the ability to "provide and maintain an acceptable level of service in the face of faults and challenges to normal operation." Threats and challenges for services can range from simple misconfiguration over ...
*
Robustness (computer science)
In computer science, robustness is the ability of a computer system to cope with errors during execution1990. IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12-1990 defines robustness as "The degree to which a system o ...
Notes and references
External links
Principle of Chaos Engineering– The Chaos Engineering manifesto
Chaos Engineering�
Adrian HornsbyHow Chaos Engineering Practices Will Help You Design Better Software�
Mariano Calandra
{{Netflix
Netflix
Software development
Reliability engineering
Software testing
Software testing tools
Disaster recovery
Automation software
Software delivery methods