Chas Engineering- A perspective

Chaos Engineering and Organisational Resilience


From Chaos comes Clarity. In the midst of Chaos there are opportunities.

???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????Sun-Tzu

???????????????????????????????????????????????????????????????????????????????????????????Author: The Art of War

?

A transnational multi-billion dollar corporation, operating in today’s digital age, is involved in collecting, storing and analysing critical data on customers from various geographies. Hence, their IT infrastructure must be resilient enough to handle varied loads of data, avoid system failures and counter any cyber threats to data as new demands from continuous deployment continue to increase in frequency.

Many organisations have adopted cloud native stacks and the DevOps model to test the reliability of their systems. However, as system glitches become normal operating conditions of increasingly complex systems, the need of the hour is for a more loop based solution that can expose the unknowns. According to Gartner, IT infrastructure, networks, or applications unexpectedly fail or crash, the damages range from a low of $140,000 to a high of $540,000 per hour.

Chaos engineering is a practice that offers expertise through diagnostics practices when applied to an organisation’s infrastructure and delivers resilience and reliability as a result. It has the potential to reveal valuable, objective information about a system’s vulnerabilities, allowing organisations to invest in system correction programs more efficiently.

Chaos engineering is the?discipline of experimenting on a software system in production?in order to build confidence in the system's capability to withstand turbulent and unexpected conditions

The combination of cloud computing, microservices architectures, and bare-metal infrastructure, make for increasingly complex systems and leave these systems vulnerable to potential points of failure and making them anything but predictable. In 2014, a leading bank used this methodology to dramatically reduce incidence counts. Since then, there has been lot of progress to the tools used for this methodology. Today more and more industry sectors like Healthcare, Finance, Virtual Gaming are using chaos engineering to their advantage.

Some of the advantages of chaos engineering that we will discuss further in this paper include:

1.?????Helping organisations predict inflection points

2.?????Reduce incident counts

3.?????Define disaster recovery response (DRR)

4.?????Achieving cost savings by avoiding system outages

Key benefits of chaos engineering:

Software services are left to the mercy of the environment in which they run, which is filled with unknowns. Today, we have evolved to become a “Digital Age” placing software systems at the heart of everything. In order to succeed in such an age, IT systems and architectures have to be resilient. Chaos Engineering is a new approach to software development and testing designed to eliminate the unpredictability within the software development environment by putting complexity and interdependence to the test. This approach is not a replacement for automated testing, functional testing and integration testing but instead helps find previously undiscovered weaknesses, that aren’t typical functional bugs.

1.?Helps organisations predict inflection point based on accurate definitions of steady state.

Incident management is the practice of recording, identifying, tracking, and assigning business value to problems that impact critical systems. The purpose of inflection point based on definitions of steady state is to enhance the customer experience by improving infrastructure reliability.

An incident management system based on accurate definitions of steady state encompasses incident identification, analysis, management, prevention, and resolution. Incident ?prevention includes incident review and incident correlation.

The idea is to conduct controlled experiments in a distributed environment that will help organisations build confidence in their systems’ ability to tolerate inevitable failures. These experiments follow 4 steps:

  • Define ‘steady state’ as some measurable output of a system that indicates normal behaviour
  • Hypothesise that this steady state will continue in both the control group and the experimental group
  • Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
  • Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group

The harder it is to disrupt the steady state, the more confidence we have in the behaviour of the system. In addition, chaos engineering services also provide the ability to revert systems back to their original states without impacting users.

Inflection point based on definitions of steady state approach is an important subset of reliability engineering, focused on ensuring that a team is prepared to manage incidents. It can also greatly improve customer experience and empower companies to meet SLAs. It will also empower organisations to be better prepared for compliance and auditing events as they arise.

Case Study:

In 2018, DBS Bank embarked on this program of reliability engineering inspired by Netflix’s Chaos Monkey testing.

Objective:

Uncover weakness in architecture design and codes.

Outcome:

  • Creation of a tool internally called Wreckoon.
  • Allowed developers to identify application behaviour under turbulent scenarios early during the development and testing cycle, alongside planning for the right fixes ahead of time.
  • Delivery of a “healthy end-user experience”, alongside the identification of dependencies on application critical paths, improving mean time to detect and also reducing mean time to repair.

2. Reduce incident counts and build system resilience.

Chaos Engineering is recognised as an engineering?discipline?where experiments are carefully chosen from isolated and controlled system failures, as a method to identify vulnerabilities within the system, which allows organisations to achieve increasing resilience.

The experiment?will introduce behaviours into the system that are both likely to occur at random and impacts core business functionality. For instance, network dependencies are a way of life in?distributed systems, and as distributed systems are increasingly joined by networks the result is increased complexity. Chaos Engineering is an optimal way to test for potential failures within network dependencies on the path to increasing resilience.

Applications have both internal and external network dependencies. Internal dependencies are those systems that are under our control and external dependencies are those that are not in our control. It is critical to have a plan that is air tight in case any incident is detected within network dependencies. By combining this plan with chaos engineering, organisations can ensure that unreliable network dependencies are unreachable from applications. Once this approach has been applied, organisations should ensure to check that their application starts up normally and is able to serve customer traffic without the dependency, in order to ensure effectiveness of their solution.

Case Study 1

Background: In 2014, National Bank of Australia moved from own physical infrastructure to Amazon Web Services. The objective of this was to reduce incident counts.

Solution: NAB deployed Chaos Monkey tool to respond to server emergencies outside of working hours on a 24X7 basis?The application was developed by Netflix to constantly test the resiliency of its Amazon-based infrastructure, and randomly kill severs within its architecture to make sure it has the ability to compensate for the failure.

Result: The new tools have allowed NAB to remove the monitoring thresholds that would flash orange when servers began to struggle, and cause phones to start ringing at all hours of the day.

3. Defines Disaster Recovery Response (DRR)

A Disaster recovery response plan aims to set up safeguards to storage, backup and availability. Chaos engineering by its very intrinsic nature caters to DRR through proactive methods involved in its experiments.

All applications have storage in one form or another, and the relationship management between application and storage is critical to overall system health. There are a number of ways that an application may stress a storage (badly defined queries, missing indices, poor sharing, upstream caching decisions, etc), but all of them result in an unresponsive data layer.

Hence, it is critical to understand how storage saturation impacts applications. There are a few ways of modelling this with a Chaos Experiment. One being, organisations can make their storage unavailable and add latency to requests already queued to their storage, thereby making it either unresponsive or slow. In addition, they can also consume Input/Output bandwidth to simulate a congested path to the storage. As storage is a critical dependency, the above experiment will make ?some features of the application either slow or unavailable and the resultant will be only the set of features affected?by the backed storage. This is also a great opportunity to align timeouts to the storage and test that they're cutting off requests as they are expected to.

Domain Name System or DNS converts readable addresses into machine readable IP addresses. The critical role that DNS plays in keeping systems running is widely unrecognised as seen when many companies experienced customer-facing issues when a real-world DNS failure occurred in October 2016. Failures like this are relatively rare and getting a recovery plan together is challenging.

Chaos engineering can help induce a DNS outage in order to understand how application respond. The fixes will vary depending on the issue, but common solutions are to pass around IP addresses instead of hostnames for internal addressing and to use of a backup DNS provider. Once organisations run this experiment, it will enable them to ?consider what sort of damage an outage of this service would cause.

What Chaos Engineering brings to fore is the ability to predict these vulnerabilities and provision a contingent plan for these situations. There are some SLAs (100%, fifty nines) for which it would be wrong to even contemplate failure let alone handle it or test for it. The seconds organisations spend thinking about it would already be worth more than the expected loss from failure.?

Case Study

In 2016 Startling Bank implemented Chaos engineering through an in-house built tool.

Objective:

To build confidence in the system’s capability to withstand turbulent conditions in production.

Solution:

Chaos demon an internally designed developed and implemented chaos engineering tool.

Outcome:

Quickly developed the ability to run other experiments that mattered to the bank, like killing?all?servers in an autoscaling group, which in effect takes out an entire service in Starling’s architecture.

System predicted the failures and built resilience to them.

4. Cost savings by avoiding system outages and safeguard company reputation.

With increased adoption of microservices and distributed cloud architectures, the web has become more complex than ever and the complexity would only increase with more distributed systems joining the bandwagon. Industries’ dependence on these systems have only increased yet and failures have become much harder to predict.

Any dependency failures can cause huge economic catastrophes for organisations. The service unavailability negatively impacts customers, transactions, operational costs and the overall reputation of an organisation. Even shorter periods of service unavailability delivers a telling impact to an organisation’s bottom line, so the?cost of downtime?is becoming an increasingly important KPI for many engineering teams. For example, in 2017, 98% of organisations said a single hour of downtime would?not only cost their business thousands of dollars but also hurts the organisations image for a very long time. The CEO of British Airways recently explained how one failure that stranded tens of thousands of British Airways (BA) passengers in May 2017?cost the company 80 million pounds?($102.19 million USD).

Companies need a solution to this challenge—waiting for the next costly outage is not an option.

Cost benefit analysis points industries in the direction of investing in solutions that help predict possible failure points and device backup plans for the same. Chaos engineering is one such solution that helps organisations saves organisations time, improves quality through continuous testing and helps manage costs not only through avoiding rework or incidents but also through improving customer experience by being ready with options when issues arise.

Case Study

Background

In August 2017, British airways lost millions of dollars due to failure in their check-in processes. Due to this failure travellers of British airways were forced to check in manually which is a much slower process?Impacting 25,000 passengers.

In May 2017, a human error caused massive IT outage due to a human error.

In April 2017, another system outage resulted in long delays in online booking, check in and access their accounts for seven hours.

Objective

British Airways unable to get their backup systems up and running on time

Solution

Identify vulnerable areas and build resilience around them by being able to predict events before they happen.

Outcome:

Through chaos engineering BA uncovered embryonic impact of information systems and relational databases regardless of the challenges encountered during implementation process as detected during the literature review. Information systems and relational databases help in great deal the business processes of the British Airways and helped accomplish its competitive edge in the airline business market worldwide.

要查看或添加评论,请登录

Ramakrishnan Subramaniam的更多文章

  • India: Medley, Malady and Medals

    India: Medley, Malady and Medals

    India: Medley, Malady and Medals. There has been plethora of debates, discussions and disagreements on the current…

    1 条评论
  • Changes we Create but don't understand the impact.

    Changes we Create but don't understand the impact.

    "Change is constant; If there is no change then we cease to exist" Change is driving the wheel of life. Our world has…

    1 条评论
  • Influence(ing) Media

    Influence(ing) Media

    Over the past few years, I have been most disappointed with watching news channels. Print media to some extent has been…

  • Our choices.. Our life!!!

    Our choices.. Our life!!!

    Oh!!! this dichotomy..

    4 条评论

社区洞察

其他会员也浏览了