登录查看更多内容

Chaos engineering

Luciano Baez

SRE & Devops (GCP,AWS, Linux, Ansible, python, etc)

发布日期: 2023年9月29日

Chaos Engineering is a practice in the technology and software development field that aims to test and evaluate the resilience of a computer system in the face of adverse conditions and chaos situations. The central idea is to introduce faults and unusual conditions into a running system in a controlled and planned manner to observe how it reacts and recovers.

The purpose of this practice is to identify weaknesses, vulnerabilities and critical points of failure in a system, as well as to understand how errors propagate and are managed. By subjecting the system to unexpected and chaotic situations, engineers can learn how to improve a system's fault tolerance, availability, redundancy, and resilience.

Some examples of chaos that can be introduced in a controlled way include:

Simulate hardware or software failures: Disable or overload specific components to evaluate resiliency and redundancy.
?Network Outages: Similar failures in network connectivity to evaluate how the system behaves under degraded or disrupted network conditions.
?Load increase: Suddenly increase the workload to evaluate the scalability and performance of the system.
?Configuration Alterations: Change configurations unexpectedly to test the system's ability to self-heal and adapt.
?External Service Outages: Similar failures in external services that the system relies on to evaluate how it handles these situations.

?Chaos Engineering is based on the idea that by identifying and correcting weaknesses in a system before they become real problems, the resilience, stability and reliability of the system as a whole can be significantly improved. This practice has become essential in modern environments, especially in distributed and cloud systems, where complexity and interdependence are high.

?The implementation of Chaos Engineering by a SRE (Site Reliability Engineer) involves conducting controlled tests and experiments to evaluate the resilience and reliability of a system. SREs have a specific focus on ensuring the operability of complex systems, and the use of Chaos Engineering is a key tool in their toolbox. Here I explain how Chaos Engineering can be implemented as SRE:

领英推荐

Failure Engineering - API Edition

Akash Saxena 6 个月前

Gordian Knots in Software Engineering

Tomasz Tunguz 1 年前

How Observability Can Transform Engineering Teams'…

Yoseph Reuveni 5 个月前

Definition of objectives and metric keys: Identify the commercial and operational objectives that you want to achieve with the implementation of Chaos Engineering. These objectives must be aligned with the resilience and reliability of the system.Define key metrics that will help evaluate system resilience and performance during experiments.
Chaos Identification scenarios: Collaborate with the team to identify realistic chaos scenarios that the system could face in production.Consider possible component failures, traffic overload, network loss, among others.
Experiment design: Design specific experiments for each identified chaos scenario. Define how failures will be introduced in a controlled and planned manner into the system.
Tools and techniques implementation: Use appropriate tools and techniques to introduce chaos in a controlled manner into the production or production-like environment.Tools like "Chaos Monkey", "Gremlin" or custom scripts can be useful for simulating errors.
Running experiments: Carry out Chaos Engineering experiments according to the previously established plan.Carefully monitor and record how the system responds to different types of faults and adverse conditions.
Analysis of results and learning: Analyzes the results of experiments to identify patterns, weaknesses, and areas for improvement in the system.Use the information obtained to propose improvements in system architecture, configuration, fault tolerance, and recovery capabilities.
Iteration and continuous improvement: Based on the learnings obtained, iterate on the experiments and tune the system to improve resilience and reliability.Make incremental improvements and continue experimenting to keep your system resilient to failure and prepared for future challenges.

Chaos Engineering is a continuous cycle of experimentation, analysis and improvement that helps ensure that systems are robust, reliable and resistant to failures in production.

There are several products and tools designed to help implement Chaos Engineering and conduct controlled experiments on computer systems. These tools allow us to simulate and monitor chaos situations in production environments and evaluate the resilience of systems. Here I mention some of them:

Chaos Monkey (from Netflix):Chaos Monkey is an open source tool developed by Netflix. It introduces random failures into the infrastructure to ensure that systems are designed to survive failures.
Gremlin:Gremlin is a platform that allows operations teams to deploy and automate Chaos Engineering experiments in a controlled manner. Provides several ways to simulate failures and evaluate the resilience of applications and systems.
Chaos Toolkit: It is an open source tool that allows you to define, run and automate chaos experiments. It offers a wide range of plugins to simulate failures in systems, applications and services.
Kube-monkey (from AWS): Kube-monkey is an open source project powered by AWS that enables controlled introduction of faults into Kubernetes environments. Helps test and improve application resiliency in Kubernetes.
Pumbaa: Pumba is an open source tool used to introduce latency, errors, and packet loss into container environments. It is especially useful for performing chaos experiments in container-based applications.
ToxiProxy: ToxiProxy is another open source tool that allows simulation of unstable networks, such as latency or network errors. It is useful for testing the fault tolerance of distributed applications and systems.
Kaos Toolkit: It is a tool that provides capabilities to run Chaos Engineering experiments and measure their impact on infrastructure and applications.

These tools facilitate the implementation of Chaos Engineering practices, allowing operations and development teams to conduct controlled testing and improve the resilience and reliability of systems in production. It is important to choose the tool that meets the specific needs of your environment and applications.

要查看或添加评论，请登录

Luciano Baez的更多文章

SRE and AI/ML: A Synergistic Approach to System Reliability

2025年1月28日

SRE and AI/ML: A Synergistic Approach to System Reliability

In the digital age, System Reliability is crucial to ensuring a seamless user experience. The discipline of Site…
From Sysadmin to SRE: A Necessary Evolution

2024年12月17日

From Sysadmin to SRE: A Necessary Evolution

Introduction Digital transformation has redefined the technological landscape, demanding more versatile and proactive…

1 条评论
SRE Best Practices

2023年11月24日

SRE Best Practices

Site Reliability Engineering (SRE) Best Practices were popularized and developed primarily by Google, in particular by…
SRE Principles

2023年10月31日

SRE Principles

The essential Site Reliability Engineering (SRE) discipline foundations are key pillars in managing technology systems…
SRE vs DevOps: Understanding the Difference

2023年10月24日

SRE vs DevOps: Understanding the Difference

Do organizations need to choose between Site Reliability Engineering (SRE) and DevOps? Are there differences between…
Primero hablemos de álgebra Lineal, luego de Machine Learning

2018年3月12日

Primero hablemos de álgebra Lineal, luego de Machine Learning

Al igual que un buen cimiento es esencial para un edificio, el álgebra lineal forma una línea de aprendizaje esencial…

2 条评论
A más de 20 a?os del algoritmo cuántico de búsqueda de Lov Grover

2018年2月24日

A más de 20 a?os del algoritmo cuántico de búsqueda de Lov Grover

Los algoritmos de búsqueda son unos de los más importantes dentro de las ciencias de la computación; permitiendo tareas…

1 条评论

See all articles

Chaos engineering

Luciano Baez

SRE & Devops (GCP,AWS, Linux, Ansible, python, etc)

领英推荐

Luciano Baez的更多文章

社区洞察

其他会员也浏览了

Transform People and Practices to Become a World-Class Digital Engineering Organization