Everyone is focused on Resiliency, but what does it really mean?
I have been working on cloud computing since the early days when AWS had about a dozen services. I have seen the trends shift from experimentation to enterprise adoption. I have seen enterprises requests mature from “What’s the ROI?” to “how do I build apps in the cloud?”. Now that many enterprises have spent a few years in the cloud, one of the biggest requests I get is “How can we be more resilient?”. I always start by asking “what does resiliency mean to you?”.
Defining Resiliency
Very often customers confuse reliability with resiliency, so it is important to understand which one they are trying to address. There are a lot of textbook definitions. AWS defines resiliency as “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload's components”. Makes sense but highly technical if you are talking to business or non-technical C-Suite leaders. NIST defines resiliency as “the ability to prepare for and adapt to changing conditions and withstand and recover rapidly from disruption” and the Oxford Dictionary has a great definition: “the capacity to withstand or recover quickly from difficulties”.
When talking to teams about resiliency it is important to get on the same page on what it means to them before going any further. I have a slide I always use to quickly define the terms reliability and resiliency that is in plain English with no technical implications and can be remembered easily.
Reliability focuses on meeting non-functional requirements such as uptime, service level agreements, performance, quality and the like. Resiliency is how the system responds when those requirements are not being met.
Why do systems become unreliable?
There are many reasons why a system could be unreliable, and it is not always related to software bugs. Have you ever lost a day of sleep trying to bring a system up only to discover a cert expired? Not fun. I tried to capture the most common reasons in the following image.
Many teams' first step to addressing resiliency is to buy observability tools. These tools are very helpful once you know what it is you want to observe but you can see from the image above that is just one piece of the puzzle. The starting point should be to map out the problem area so that you can see all the dependencies of the system. Each dependency is a potential failure area. By dependencies I don’t mean strictly within the architecture but also with in organizational structures and business processes.
领英推荐
Remember, resiliency is about how quickly we can respond to adversity in a system. A big part of resiliency is MTTR or mean time to “repair”. I put repair in quotes because a well architected system can mask failures meaning a part of the system becomes unreliable, but the end users don’t witness any service degradation. Often the time it takes to fix or mask an issue is directly related the processes such as incident management, change management, problem management and so forth. Too often each one of these processes are owned by a silo and tickets live for a long period of time as they are passed from group to group. The point here is tools like observability may help you find issues quicker but a 7-step process to resolution is still a 7-step process to resolution.
Building resilient systems
To build resilient systems, I recommend focusing on these 4 areas: Architecture, Operations, Process, Team Structure. Architecture because well architected systems fail less. But more importantly, we should be proactively building systems that anticipate failure and build mechanisms to mask the failure or self-heal. Leveraging chaos experiments is a great way to proactively discover weaknesses in the architecture and fix them before they appear in production.
Operations. As we embrace the cloud, we are implementing more complex and distributed systems than we have in the past and with the adoption of CI/CD pipelines we are deploying more frequently than ever before. If we want to compete, we can no longer afford long review cycles and massive check lists. We must trust in our automation that we implement in our cloud platforms, pipelines and our monitoring. Operations is becoming too complex for humans to watch everything, therefore, we see a push towards concepts like observability and AI Ops.
Process. When I visit customers trying to improve resiliency, often the culprit for bad recovery time is process related. I did a value stream mapping exercise for a customer once and discovered that no matter what the issue was, the best resolution time was 5 days. Even for simple changes. There were so many reviews and approval steps involved that even changing a color on a web page would be a week to implement. Once we made the flow of work visible, people were in shock on the amount of toil that was in the system. Some of it was because of the number of silos within the organization and some of it was a reaction to an outage that happened before most of the people were even there. Instead of resolving the issue, the company put another manual step in the process and over the years it just became a rubber stamp approval process. Nobody knew why they did, but they still did it and it could add over one day to the process if the approver was not available. All of these processes impact our ability to quickly recover.
Team Structure
Silos are for grain. The biggest productivity and resiliency killer is silos. I did a DevOps assessment at a high-tech client a few years back. They had the best score I ever saw from a technical perspective. They were implementing microservices, containers, fully automated infrastructure and pipelines but their products' reliability was continuing to decline and their time to recover was also declining, Customers were yelling. If they were so good at technology, why were they failing?
The reason was silos. They had a product team, a platform team, a cloud infrastructure team and an Ops team called SRE (bites tongue). They all had their own backlogs and different goals and objectives. Everybody knew that their issues were caused by the fact that their successful growth has caused the system to outgrow its original architecture, especially in the database area. Long story short, they outgrew what a traditional relational database could handle and needed to supplement it with alternatives. The problem fell on the both the infrastructure and platform teams and the product team had nothing to do with it. But the changes required would require changes in the product. The product team’s goals and objectives were too feature focused and reliability and resiliency were the responsibility of the SRE team.
It gets worse. For the SRE team to improve their response time, they needed more information in the logs. But they could never get that prioritized by the product team. So they built several home-grown solutions to get more information from the systems. Over time these solutions became way too complex and grew its own huge backlog of technical debt. The end result was angry customers because the same problems kept resurfacing.
They implemented observability solutions, chaos engineering and basically all modern ops practices but because they couldn’t influence the product backlog, so they continued to work very hard instead of fixing the root cause. This was an organizational structure challenge, not a technical challenge.
Summary
The main point of this post was to articulate that improving resiliency requires more than just buying observability tools or implementing a chaos monkey. It requires system thinking that makes all work and dependencies visible and systematically improving weaknesses within a system. Sometimes it’s a technical fix but sometimes it much harder because it requires process or organizational change.