What is Chaos Engineering

What is Chaos Engineering

  • Chaos Engineering is a disciplined approach to identifying failures, vulnerabilities before they become outages.
  • It can be a method / experiment on any system or infrastructure that lets you expose weaknesses of it before it become a real problem and gives insights so that we can handle these adverse conditions in the production environments. Which can enforce development teams to understand common failure types, and build graceful failure and recovery into their code before it goes live into market.
  • From a security standpoint, it helps isolation of application from an attacker and prevents application from going into a bad state.
  • Chaos Engineering helps in injecting something harmful, in order to build an immune system. i.e. experimenting with your system to find and then fix its weaknesses is one of the best ways to increase uptime, reliability, speed and security of the software you’re releasing.

Why we need chaos Engineering ?

  • When we consider large-scale distributed systems, there are numerous chances of failures including misconfiguration, application failure, network failure, infrastructure failure, dependency failure, cloud outages natural disasters and so on.
  • One of the characteristics of high-quality software is resiliency. Adverse circumstances constitute things going wrong in the production environment that might bring the application down or seriously degrade performance. It could also involve defects in the application that might crash the application or cause it to generate errors.
  • In a world of distributed systems, a single minute of downtime can be costly.
  • It could include network segments failing, traffic spikes, race conditions, data centers going down, distributed denial of service (DDoS) attacks, or other unpredictable circumstances that could lead to service outages in production. 
  • Chaos Engineering is the principle of finding weaknesses in distributed systems by testing real-world outage and unlikely systems failure scenarios on production systems.
  • By doing Chaos Experiments, we can find or generate information about how systems as a whole react when individual components fail.

How to do chaos Engineering?

Chaos Engineering is achieved via Chaos Experiments, below are some examples of these experiments.

  • Conduct a chaos engineering kick-off at an all-hands meeting
  • Hypothesize experiments or testing about steady state by collecting data on the health of the system
  • Testing database failover or application restart and crash recovery processes 
  • Triggering varied real-world events by turning off a server to simulate regional failures
  • Run your experiments as close to the production environment as possible
  • Ramp up your experiment by automating it to run continuously
  • Minimize the effects of your experiments to keep from blowing everything up
  • Learn the process for designing chaos engineering experiments
  • Use the Chaos Maturity Model to map the state of your chaos program, including realistic goals
  • Send email updates and progress reports
  • Run monthly metrics reviews and Deliver presentations

Implementing an effective chaos engineering practice

Chaos Engineering is implemented effectively by checking the system’s reliability, stability, and capability of surviving against unstable and unexpected conditions.

Broadly we can implement it in 3 simple steps:

1. High severity incident management - this is the practice of recording, triaging, tracking, and assigning business value to problems that impact critical systems.

  • SEV ( severity) lifecycle Stages include detection, diagnosis, mitigation, prevention, closure, and detection.
  • Detection - identify critical systems (example- traffic management, databases and storage) and must have a really fast mean time to detect (MTTD) for critical systems. 
  • SEV lifecycle can be achieved with running incident simulation.
  • Categorize SEV levels as per your system.
No alt text provided for this image

2. Monitoring - implementing effective aggregated metrics and log collection, and also creating corresponding critical services dashboards.

3. Measure the impact of downtime - how SEVs impact their customers and business. 

  • Impact includes system properties like availability and durability, and business impact such as (damaging) outcomes, the cost of downtime, right down to broken service level agreements and lost customers.
  • Once British Airways was down for ten hours - they have estimated that meant an $80 billion pounds loss in revenue.

6 B's of implementing chaos Engineering:

1. Build - Build a new system / improve existing

2. Borrow - Use open source / contribute to OS

3. Buy - Use 3rd party systems

4. Brush up - Game Days / Team training

5. Break - Chaos Engineering / Failure injection

6. Begone - Decommission systems / delete code

Chaos Engineering Principals

1. Setup Infrastructure and Services in such a way that there is "No Single Point of Failure"

2. Design systems with failure in mind from the beginning

3. Replicate data at multiple locations and sync, switch and interconnect as fast as possible.

4. Do Idempotent Things as much as possible for example - Microservices with circuit breakers, bulk heads can scale better, more resilient and reliable and also limits application failures.

5. Do more experiments to fail fast your data center and control your switch to make system up from failover and gradually make it more resilient from outside driven collapses.

6. Be more permissive on failures and become more tolerant on risk

7. Have antifragile feedback loop to have antifragile system

8. Break it more to make it more safer and restorable / restartable system

9. Understand the cost of downtime

10. Situational Awareness and Attack-Driven Defense

Chaos Experiment

  • Chaos engineering performs wide, careful, and unpredicted experiments that generate new knowledge about the system’s behaviors, properties, and performance.
  • There are many companies with huge customer bases that are dedicated to offering a seamless experience to their users. And to ensure consistent performance and constant availability, Telecom, healthcare, educational, and finance organizations are implementing chaos experiments.
  • Chaos in distributed systems requires two groups to control and monitor the activities – an experimental group that experiments, and a control group that deals with the effects of experiments.

Steps for a chaos experiment.

No alt text provided for this image


  • Define Steady State - Define that ideal state of the system’s normal behavior.
  • Create a control group and an experimental group.
  • Introduce real-world wrenches, like changing servers.
  • Pick a Hypothesis - Chaos engineers hypothesize an expected outcome when something goes wrong.
  • Design and run Experiments - Design experiments with variables to reflect real-world events like dependency failure, server failure, network or memory malfunction, and so on.
  • Validate Hypothesis - Try to find the difference or weakness between the control and what is crashing. i.e. Measuring the impact of test and observing the difference of the steady-state in both the groups
  • Then you prioritize those events by either frequency and/or magnitude of that failure occurring. And don’t just consider fault in your code, but also hardware failures, servers dying, and security issues.
  • One final rule of chaos engineering is that you have to run these experiments in production to know truly how they will react, dependent on your environment and traffic patterns, something that you can’t mimic as well in a staging area.
  • If an engineering team can find weaknesses in the system, then it is a successful chaos experiment, otherwise, they expand their hypothetical boundaries.
  • Fix Issues - When weaknesses are found, the team addresses and fixes those issues before they become system-wide troubles.
  • Note: As chaos experiments are in a production environment or closer to the production environment, there are chances that customer experience might get affected. So, it is always wise to plan the smallest experiments and be ready to carefully handle the impact.

Tools for chaos Engineering

1. Netflix Toolkit for Chaos Engineering

  • The Netflix Engineering Tools team came up with an innovative idea to test the fault tolerance of the system without any impact on customer service.
  • They created the Chaos Monkey tool which is inspired by the idea of a monkey who enters in the farm and randomly destroy the objects.
  • Chaos Monkey is a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.
  • The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link. - Netflix Tech Blog
  • After the success of the Chaos Monkey tool, the Netflix Team has created a suite of tools that supports chaos engineering principles and named it the Simian Army, to check the reliability and resiliency of AWS infrastructure.
  • List Of Tools Developed By Netflix:
  • Chaos Monkey
  • Latency Monkey
  • Doctor Monkey
  • Conformity Monkey
  • Janitor Monkey
  • Security Monkey
  • Chaos Gorilla
  • Chaos Kong
  • 10–18 Monkey
  • Chaos Monkey, Chaos Gorilla, and Chaos Kong check that the system is set up and designed correctly to handle failures by randomly injecting failures into the production runtime
  • The other monkeys are rule-driven compliance services that automatically monitor the runtime environment to detect changes and to ensure that configurations match predefined definitions. They look for violations of security policies and common security configuration weaknesses (in the case of Security Monkey) or configurations that do not meet predefined standards (Conformity Monkey). They run periodically online, notifying the owner(s) of the service and infosec when something looks wrong. The people responsible for the service need to investigate and correct the problem, or justify the situation.
  • Security Monkey captures details about changes to policies over time. It also can be used as an analysis and reporting tool and for forensics purposes, letting you search for changes across time periods and across accounts, regions, services, and configuration items. It highlights risks like changes to access control policies or firewall rules.
  • These are all chaos tools that are constantly testing the system against all kinds of failures, building a higher level of confidence into the system’s ability to survive.
  • In real world, Chaos Monkey tests and (in some cases) wreaks havoc on production applications. These tools introduce network delays, cause instances or even entire data center segments to go offline, or identify security vulnerabilities. They also can perform health checks on an application and clean up unused system resources.
  • For those who work specifically with applications, the Apache-licensed Chaos Toolkit simplifies access to chaos engineering concepts. It provides an API that enables the experimentation approach can be done at different levels: infrastructure, platform but also application.
  • Similar tools include Amazon’s AWS Inspector, which is a service that provides automated security assessments of applications deployed on AWS, scans for vulnerabilities, and checks for deviations from best practices, including rules for PCI DSS and other compliance standards. It provides a prioritized list of security issues along with recommendations on how to fix them.
  • You can write your own runtime asserts:
  1. Check that firewall rules are set up correctly
  2. Verify files and directory permissions
  3. Check sudo rules
  4. Confirm SSL configurations
  5. Ensure that logging and monitoring services are working correctly
  6. Run your security smoke test every time the system is deployed, in test and in production.
  • Tools like Puppet and Chef will automatically and continuously scan infrastructure to detect variances from the expected baseline state and alert or automatically revert them.

2. Pumba is a chaos testing tool for Docker. 

3. https://proofdock.io/ - Chaos Platform for Azure - Proofdock’s Chaos Platform supports you to write, run, store and analyze chaos experiments to ensure that your application works reliably

4. Cthulhu, a chaos engineering tool that allows DevOps teams to design resilient, self-healing services across hybrid and multi-cloud infrastructures.

  • Cthulhu enables automated cross-platform failure orchestration, using a data-driven approach to simulate complex disaster scenarios. This allows organizations to design more robust systems that better anticipate failure and — more importantly — improve self-healing mechanisms to accelerate automatic recovery.
  • Core features of Cthulhu include:
  1. Cross-platform failure orchestration to automatically run random failure scenarios in any environment and on a schedule.
  2. Version-controllable scenarios so that once a vulnerability is identified, engineers can easily reproduce it in different environments.
  3. Automated communications to allow select team members to monitor the evolution of failure experiments and insights gained through targeted notifications.
  • Cthulhu is available now in xMatters’ GitHub: https://github.com/xmatters/cthulhu-chaos-testing

Chaos Engineering examples

  1. Chaos engineering on databases - For example you have a master with two replicas underneath it. If you shut down a database, a new replica was created and a clone popped in as a replica, so again there would be a master with two replicas. The replica would then be promoted to be the master if you shut down a master.
No alt text provided for this image
  1. Chaos Experiment for Docker - a best practice for chaos testing that it is possible to define a repeatable time interludes and duration parameters to better control the chaos. With this, you can built Pumba to distribute to a single Docker host, a Swarm cluster, or a Kubernetes cluster. It can do things like:
  • Stop running Docker containers.
  • Kill the send termination signal. 
  • Remove containers.
  • Stop a random container once every ten minutes.
  • Kill a MySQL container every 15 minutes.
  • Kill random containers every 5 minutes.
  • Pause the queue for 15 seconds every 3 minutes.

3. Chaos Engineering in a Spring Boot Microservice

  • One of the tools that can help you to implement chaos engineering in a spring boot application is Chaos Monkey.
  • Chaos monkey, a tool created by Netflix aims to help applications tolerate random instance failures. This tool has introduced some principles of chaos engineering into spring boot application.

Functionalities

  • Latency assault : adding random latency to REST endpoints
  • Exception assault : throwing random runtime exceptions
  • AppKiller assault : killing the app

Chaos Engineering with Devops

  • Integrating chaos engineering into the DevOps toolchain contributes to the goal of continuous testing. Many companies such as Netflix, Amazon, Google. Microsoft, Twilio, LinkedIn, Dropbox, Uber, Slack follow that discipline
  • Chaos principles are the best approach to test a system’s ability against failures when it comes to DevOps-driven software development. 
  • Chaos Engineering tends to be used primarily in DevOps during continuous testing phase: setting up experiments to run software under different conditions, such as peak traffic, and monitoring how it functions and performs. This becomes increasingly necessary in cloud-based systems where failure to understand extreme load responses could result in runaway cascade failures or, worse yet, spinning up thousands of extra nodes handling error conditions while not doing any actual work.
  • Similarly it can be used during infrastructure testing stage in a devops pipeline to provision the infrastructure.
  • Chaos Engineering helps in defining a functional baseline and tolerances for infrastructure, policies, and processes by clarifying both steady-state and chaotic outputs when extremes are reached.
  • Chaos Engineering monitors system dynamics, and when a problem occurs, engineering change management is brought in to remediate the issue
  • With the growth of containerization in cloud applications today, IT infrastructure looks more like development environments than classical multi-tier architectures. But the limitless scale of the cloud means failures can also be limitless: microservices are well-served by testing elasticity and scalability, data flows, and resiliency through stressing the system to the edge of its tolerances and fixing their shortcomings before a public crash.
  • DevOps values production feedback and emphasizes the importance of measuring and monitoring production activity. You can extend the same approaches—and the same tools —to security monitoring, involving the entire team instead of just the SOC, making security metrics available in the context of the running system, and graphing and visualizing security-related data to identify trends and anomalies.
  • Recognize that your system is, or will be, under constant attack. Take advantage of the information that this gives you. Use this information to identify and understand attacks and the threat profile of the system.
  • Attacks take time. Move to the left of the kill chain and catch them in the early stages. You will reduce the Mean Time to Detect (MTTD) attacks by taking advantage of the close attention that DevOps teams pay to feedback from production, and adding security data into these feedback loops. You also will benefit by engaging people who are closer to the system: the people who wrote the code and keep the system running, who understand how it is supposed to work, what normal looks like, and when things aren’t normal.
  • Feed this data back into your testing and your reviews, prioritizing your actions based on what you are seeing in production, in the same way that you would treat feedback from Continuous Integration or A/B testing in production. This is real feedback, not theoretical, so it should be acted on immediately and seriously.
  • Information on security events helps you to understand and prioritize threats based on what’s happening now in production. Watching for runtime errors and exceptions and attack signatures shows where you are being probed and tested, what kind of attacks you are seeing, where they are attacking, where they are being successful, and what parts of the code need to be protected.
  • This should help drive your security priorities, tell you where you should focus your testing and remediation. Vulnerabilities that are never attacked (probably) won’t hurt you. But attacks that are happening right now need to be resolve right away.
  • If you can’t successfully shift security left, earlier into design and coding and Continuous Integration and Continuous Delivery, you’ll need to add more protection at the end, after the system is in production. Network IDS/IPS solutions tools like Tripwire or signature-based WAFs aren’t designed to keep up with rapid system and technology changes in DevOps. This is especially true for cloud IaaS and PaaS environments, for which there is no clear network perimeter and you might be managing hundreds or thousands of ephemeral instances across different environments (public, private, and hybrid), with self-service Continuous Deployment.
  • A number of cloud security protection solutions are available, offering attack analysis, centralized account management and policy enforcement, file integrity monitoring and intrusion detection, vulnerability scanning, micro-segmentation, and integration with configuration management tools like Chef and Puppet. Some of these solutions include the following:
  • Alert Logic
  • CloudPassage Halo
  • Dome9 SecOps
  • Evident.io
  • Illumio
  • Palerra LORIC
  • Threat Stack
  • Another kind of runtime defense technology is Runtime Application Security Protection/Self-Protection (RASP), which uses run-time instrumentation to catch security problems as they occur. Like application firewalls, RASP can automatically identify and block attacks. And like application firewalls, you can extend RASP to legacy apps for which you don’t have source code.
  • But unlike firewalls, RASP is not a perimeter-based defense. RASP instruments the application runtime code and can identify and block attacks at the point of execution. Instead of creating an abstract model of the code (like static analysis tools), RASP tools have visibility into the code and runtime context, and use taint analysis and data flow and control flow and lexical analysis techniques, directly examining data variables and statements to detect attacks. This means that RASP tools have a much lower false positive (and false negative) rate than firewalls.
  • You also can use RASP tools to inject logging and auditing into legacy code to provide insight into the running application and attacks against it. They trade off runtime overheads and runtime costs against the costs of making coding changes and fixes upfront.
  • There are only a small number of RASP solutions available today, mostly limited to applications that run in the Java JVM and .NET CLR, although support for other languages like Node.js, Python, and Ruby is emerging. These tools include the following:
  • Immunio
  • Waratek
  • Prevoty
  • Contrast Security
  • tCell is a cloud-based SaaS solution that instruments the system at runtime and injects checks and sensors into control points in the running application: database interfaces, authentication controllers, and so on.
  • It uses this information to map out the attack surface of the system and identifies when the attack surface is changed. tCell also identifies and can block runtime attacks based on the following:
  • Known bad patterns of behavior (for example, SQL injection attempts)—like a WAF.
  • Threat intelligence and correlation—black-listed IPs, and so on.
  • Behavioral learning—recognizing anomalies in behavior and traffic. Over time, it identifies what is normal and can enforce normal patterns of activity, by blocking or alerting on exceptions.
  • tCell works in Java, Node.js, Ruby on Rails, and Python (.NET and PHP are in development).
  • Twistlock - Twistlock provides runtime defense capabilities for Docker containers in enterprise environments. Twistlock’s protection includes enterprise authentication and authorization capabilities
  • Twistlock scans containers for known vulnerabilities in dependencies and configuration (including scanning against the Docker CIS benchmark). It also scans to understand the purpose of each container. It identifies the stack and the behavioral profile of the container and how it is supposed to act, creating a white list of expected and allowed behaviors.
  • An agent installed in the runtime environment (also as a container) runs on each node, talking to all of the containers on the node and to the OS. This agent provides visibility into runtime activity of all the containers, enforces authentication and authorization rules, and applies the white list of expected behaviors for each container as well as a black list of known bad behaviors (like a malware solution).And because containers are intended to be immutable, Twistlock recognizes and can block attempts to change container configurations at runtime.
  • Gremlin - Gremlin Inc. - Failure as a Service.
  • Gremlin provides an enterprise service to safely, securely, and simply run Chaos Experiments that prevent outages and reduce downtime. Rather than guess how your systems and employees will react to common failure scenarios, harden them with Chaos Engineering. Gremlin offers a comprehensive and easy-to-use enterprise service to safely and securely use Chaos Engineering to build and operate reliable applications.

References

https://github.com/dastergon/awesome-chaos-engineering




要查看或添加评论,请登录

Kuldeep Gupta的更多文章

  • Policy as a code : Deep Dive

    Policy as a code : Deep Dive

    What is Policy? A policy is typically a document that outlines specific requirements or rules that must be met. In the…

  • Redis - An Overview

    Redis - An Overview

    What is Redis? Redis - REmote DIctionary Server is an Open source Leading in-memory database platform, supporting high…

社区洞察

其他会员也浏览了