登录查看更多内容

Chaos Engineering

Shahzad Masud

发布日期: 2023年1月23日

A discipline of experimenting (to build confidence) on a system's capability to withstand turbulent conditions in production.

Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices that increase flexibility of development and velocity of deployment. How much confidence we can have in the complex systems that we put into production?

We need to identify weaknesses before a manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.

An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment (aka Chaos Engineering).

IN PRACTICE

To specifically address the uncertainty of distributed systems at scale, Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow four steps:

Define ‘steady state’ as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.

ADVANCED PRINCIPLES

The following principles describe an ideal application of Chaos Engineering, applied to the processes of experimentation described above. The degree to which these principles are pursued strongly correlates to the confidence we can have in a distributed system at scale.

领英推荐

Engineers of Endava | Meet Ivana

Endava 1 年前

Observability as a Cultural Change Enabler in…

Yoseph Reuveni 2 个月前

Amplifying Software Engineering: Unveiling the Future…

Kishore Kamarajugadda 1 年前

Build a Hypothesis around Steady State Behavior

Focus on the measurable output of a system, rather than internal attributes of the system. Measurements of that output over a short period of time constitute a proxy for the system’s steady state. The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady state behavior. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.

Vary Real-world Events

Chaos variables reflect real-world events. Prioritize events either by potential impact or estimated frequency. Consider events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a Chaos experiment.

Run Experiments in Production

Systems behave differently depending on environment and traffic patterns. Since the behavior of utilization can change at any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment directly on production traffic.

Automate Experiments to Run Continuously

Running experiments manually is labor-intensive and ultimately unsustainable. Automate experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.

Minimize Blast Radius

Experimenting in production has the potential to cause unnecessary customer pain. While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained.

Chaos Engineering is a powerful practice that is already changing how software is designed and engineered at some of the largest-scale operations in the world. Where other practices address velocity and flexibility, Chaos specifically tackles systemic uncertainty in these distributed systems. The Principles of Chaos provide confidence to innovate quickly at massive scales and give customers the high quality experiences they deserve.

要查看或添加评论，请登录

Shahzad Masud的更多文章

Understanding the Growing Focus on Digital ID Credentials

2025年1月8日

Understanding the Growing Focus on Digital ID Credentials

Digital identity has become an essential component of modern life. From accessing online services to proving one's…

2 条评论
Unlocking the Potential of DORA Metrics Across Engineering: How AI Elevates Performance

2024年12月31日

Unlocking the Potential of DORA Metrics Across Engineering: How AI Elevates Performance

In the fast-paced world of technology, measuring performance is key to driving success. Created for DevOps, DORA…

3 条评论
AI-Powered Talent Strategies: Reducing Technical Debt Through Intelligent Solutions

2024年12月4日

AI-Powered Talent Strategies: Reducing Technical Debt Through Intelligent Solutions

The accumulation of technical debt—a byproduct of rapid technological implementations and short-term fixes—remains one…
Rethinking Event Streaming: Kafka and Its Modern-Day Contenders

2024年11月27日

Rethinking Event Streaming: Kafka and Its Modern-Day Contenders

In today’s interconnected world, where real-time decision-making is critical for business success, event streaming…

1 条评论
Architecting Success: A Thoughtful Journey

2024年11月20日

Architecting Success: A Thoughtful Journey

When we hear the word "architecture," it often sparks images of towering buildings or intricate blueprints. In the tech…

3 条评论
Solving the Skype Freezing Issue on M2 Macs with Rosetta Mode: A Practical Guide for Remote Workers and Mac Users

2024年10月31日

Solving the Skype Freezing Issue on M2 Macs with Rosetta Mode: A Practical Guide for Remote Workers and Mac Users

As remote work continues to grow, the reliability of communication tools like Skype is paramount. For many…
The Role of CTOs in a Generative AI-driven Future

2024年10月23日

The Role of CTOs in a Generative AI-driven Future

Enterprises across industries are beginning to embrace generative AI, but the journey to fully harness its potential is…

2 条评论
STRESS - 14 HABITS TO OVERCOME

2023年11月12日

STRESS - 14 HABITS TO OVERCOME

Dealing with emergencies at work can be very frustrating and exhausting. It’s important that we know how to relieve…

1 条评论
mysqldump: Unknown table ‘column_statistics’ in information_schema

2023年10月20日

mysqldump: Unknown table ‘column_statistics’ in information_schema

If you have ever attempted to back up a MySQL or MariaDB database using the mysqldump tool, you might have encountered…
AaaS (Authentication as a Service)

2021年5月24日

AaaS (Authentication as a Service)

Authentication may be achieved in several different ways depending on the type of application and the context in which…

4 条评论

See all articles