Finding the unknown unknowns
Source: Adobe Stock / Creative Cloud Express standard license

Finding the unknown unknowns

Introduction

How do I prevent an outage I can't predict?

If you work as a site reliability engineer (SRE) in the software as a service (SaaS) business, I'm guessing you've heard, thought, or said some version of this at least a few times. Teasing apart that statement could be the subject of an entire book. Good root-cause and post-incident analysis can create insights to identify unknown issues that cause downtime. But you uncover those only after something bad has happened.

Most outages are not simple. They involve the interaction of multiple causes in some order to create sufficient conditions for the outage to occur. In simple outages, sufficiency is obvious and once you've found it, there seems a diminishing return on further examination. In more complex outages, sufficiency can lead to quite a few investigative vectors. So how to we sort through all of this? How are we supposed to predict all of the things that can happen to our service? And how to we identify the things we don't know about or aren't thinking about--the unknown unknowns?

What are unknown unknowns?

In a press conference in 2002, Donald Rumsfeld, then United States Secretary of Defense, famously said ...

But there are also unknown unknowns—the ones we don't know we don't know. And ... it is the latter category that tends to be the difficult ones.

I've always been struck about how directly that spoke to my experience. In 2002, that experience was as an application service provider (ASP) rather than a SaaS provider, but at some level, outages are outages, you know? Ever since then I've been thinking about it. How do you identify unknown unknowns?

The known / unknown quadrants

Donald Rumsfeld's un-excerpted comments that day clearly implied a set of known / unknown quadrants, so let's just get that out of the way here.

No alt text provided for this image

Aside from this paragraph, I am not going to talk about anything except the bottom right--the unknown unknowns--because it is the one that carries unquantifiable risk. In the other three quadrants, the risk related to the kinds of things that might be in the quadrant is small for a variety of reasons. For example, I had a tire go flat recently, and I realized I was not aware that I can recognize a flat tire from the way the steering wheel feels. That is an unknown known, but being unknown did not change the risk or outcome of the event. Similarly, when I am aware that I don't know something, there is usually an implicit risk decision being made. If the unknown is worrisome enough, I am compelled to go figure it out. I don't check my tire pressure before every trip, to continue the metaphor--it's a known unknown. In the absence of any other relevant conditions, it doesn't present a serious risk. If the temperature had just dropped a lot, or I was forced to drive through debris on the road, or had a severe impact with a pothole, I would be inclined to keep a closer eye on it. So the green cells above feel somewhat self-regulating for risk. I believe the same is true in our tech stacks, mostly because there is enough awareness implied in each of those green cells that if something truly risky presents itself, we know and we take action. That is not possible with the unknown unknowns, since there is no awareness to being with.

The more interesting issues are in the bottom right quadrant. The things we don't know that we don't know. When I had that flat tire fixed, by the way, they told me the tire had dry rot. I had no idea what dry rot was. It was an unknown unknown and I immediately recognized it as such. It made me wonder if I could extrapolate other unknown unknown causes from this one. I thought about a class of problems that might affect tires--environmental issue--and sure enough, there are some other causes that popped up, like having intense sunlight on the tires all the time as in the tropics, or conversely, being in an ultra cold climate. These are pretty simple examples, but you get the point. And then, just by identifying them, they immediately transform to known unknowns and that implicit, mental risk-math starts to happen.

Let me try to use this foundation to inspire expansive thinking about classes of causes and how they can be inspected to look for unknown unknowns. I am actually going to focus on lowering the number of incidents and the MTTR overall, because that is a very practical outcome for such an exercise, and I'm going to use root cause analysis as the process to generate some starter data.

Types of Causes

First, let's step back a moment and think about cause and effect. When you look for a root cause, you usually find a set of antecedent causes that combine in some sequence along some timeline to cause a consequence--an outage. But causes can have different characteristics. I constrain us to three, but understand that this is not a formal proof or statement of logic. I am taking some liberties here with the concepts of necessary, sufficient, and contributory to keep things simple.

Necessary causes

Necessary causes are things that must be true if the outage happened. Unfortunately, there are a lot of necessary causes, so it's not super helpful in this context. Even in a simple outage, like a server failure caused by a memory leak, there are a lot of things that are necessary but also irrelevant to the root cause. It is necessary that there be power to the servers that run the application, but we can't go to that level, can we? In fact, we aren't going to talk much about necessary causes unless they are directly related to the outage, we just want to be aware of what they are.

Sufficient causes

I prefer to think in terms of sufficient causes--the set of events that, when combined, will produce the outage. That is, if these things are all true, the outage must happen. Occasionally that is just a single thing, but that is very unusual. It is more often a combination of at least a few different things. A concise, well crafted, root cause narrative should enumerate all of the causes required for sufficiency. Here is a simple example we can use and refer back to:

The workload contains an expensive database query
There is a sudden, sustained, major increase in demand for that workload
The database tier becomes resource bound and slows down
The database fails due to resource exhaustion        

The top three items are the sufficient causes for this hypothetical outage. If all three of them happen then the system will fail, it's just a matter of timing. However, a good root cause analysis doesn't stop at the point of the failure. It needs to contain all of the events to the moment when actions turn to recovery. At that point, the on-call team has figured out a plan to restore service and is working through it, increasing the likelihood of system recovery as they go. All of the events up until that time are relevant to understanding the possible causes for the outage and associated repair time and they are certainly all worthy of inspection as possibly risks under different circumstances.

Contributing causes

I define contributing causes as those that affect behavior (and therefore the timing) of the outage or recovery but are not sufficient causes. If you take away a contributing cause, the outage still happens. As an example, if you've been involved in responding to outages, you've probably heard someone say...

Yes, that issue hurt us, but it's not what caused the outage...

or

We would have made it through that if it weren't for...

Bingo. That's language that indicates possible contributing causes. In our example above, perhaps there was a problem fetching a credential to start the rebuild operation. That is an important factor in the time to repair, but it is not a sufficient or even necessary cause for the outage.

And so, as I define them, contributing causes don't even have to be necessary causes. The fact that the outage happened doesn't mean these contributing causes must have happened. And yet they are important for two reasons:

  1. They represent things that could cause future outages in other contexts where they are unknown unknowns. This gives you tactic to lower your rate of incidence, along with fixing all the sufficient causes.
  2. Fixing them will usually shorten the repair time for similar outages in the future. This is gives you a tactics to lower your MTTR.

That combination--attacking both the rate of incidence and the MTTR--is extremely powerful because the product of those is downtime.

Analyzing causes to find unknown unknowns

When I outlined sufficient causes above, I explained that this was the set of causes that combine to cause an outage. When we start to apply this in practice, there is a tendency to focus on those to the exclusion of other things. I think this is human nature. Once we have internalized the causal events and timeline that caused the outage--all the sufficient causes and maybe a few others--our nature is to pivot towards short- and long-term fixes for all the discovered issues. Those are concrete problems we can solve and engineers are really good at solving concrete problems. It also probably meets the requirements for existing, post-incident analysis process, or ITIL or SOC-2 or whatever you are using, and so that feels satisfying as well.

But when you are trying to find unknown unknowns, understanding sufficiency is ironically insufficient for a complete analysis. It's only one level deep. Let me go ahead and add another three levels to it.

  1. Do I understand the set of causes that were sufficient to cause the outage? This is a primary outcome of a good root cause analysis.
  2. Have I examined other contributing or necessary causes, particularly those that aren't required for sufficiency?
  3. Have I looked at ways to shorten recovery time for this outage independently of the causes that have already been considered? Are there new insights into how to restore service more quickly if a similar outage happens again?
  4. Are there behaviorally-driven causes that affect the occurrence of the outage or the recovery time that were not previously identified? Have I examined all of the potential human factors issues? (bear in mind that it's possible there is a human factor issue as a sufficient cause. In that case, you look for others.)

Thought exercise - causes are instances of classes

Think of each cause or issue uncovered in the analysis above as an instance of a class of related causes with associated attributes. What other instances of these classes might there be, and do they pose a risk? This is essentially what I was suggesting for #3 in the section above, but here I want to take it a bit further. Let's take two causes from the example above and make them instances of classes. The two causes are the increase in workload and inefficient query. The former is an instance of the class burst events, the latter an instance of the class inefficient code. The first instance of each class listed (costly workflow and expensive SQL queries) are those from our hypothetical outage. The additional instances after those are an example of the kind of thinking I am talking about.

It is really just a failure analysis using the causes from our thought exercise as input. In the example below, I'll just point out a few things. These enumerated cause are going to look familiar, because they happen all the time in a lot of contexts. There is zero magic in what I'm talking about relative to that. What I think is powerful is the idea that you take an extrapolation and cross reference it against a diverse set of services in your brain to see if something scares you. There will even be cases where you recall causes from prior outages, like the Jira ticket referenced below, and start associating it with a class of outages.

No alt text provided for this image

Of course, I'm not expecting folks to go off and start drawing class diagrams for root cause analysis documentation. The point is to think about it that way and let the natural power of your brain do the rest of the work. In the course of thinking through other instances of these classes, do you stumble across something that presents a previously unappreciated or unknown risk? These instances are possible candidates for the elusive unknown unknowns, particularly when view from the perspective of another system.

Final thought - focus on downtime

I opened this article talking asking, "How am I supposed to prevent an outage that I can’t predict?" But I need to revise that now because it is not quite right. The goal I want to focus on is to reduce downtime. There are two basic ways to do that.

  1. Reduce the number of outages that occur, i.e., lower the rate of incidence. The table stakes for that is to fix all the sufficient causes for the outages you have had. Building on that, we expand our thinking into classes of related failures to identify other possible high risk causes--causes not sufficient or even implicated in a prior outage.
  2. Fix outages as fast as possible when they occur, i.e., lower the MTTR. Here is where everything else comes into play, particularly the contributing causes. Those and any other items discovered during the thought exercises above need to be evaluated on a cost/benefit basis relative to impact on MTTR.

Pragmatism is important here. Of course we may well have a requirement to fix all the sufficient causes ASAP. But please bear in mind there might be a bigger and better outcome attacking MTTR depending on the nature of the issues. Downtime is downtime, and sometimes the quickest, most effective way to lower it is to pivot to MTTR issues.

No alt text provided for this image

Preventing outages is a no-brainer, but these days it's often more advisable to engineer something that recovers more quickly than trying to make it fail less often. We should have both approaches at our disposal, because they both decrease downtime.

By the way, do you know where there is another great source of content for unknown unknown problems? In your backlog of operational debt , but I'll leave that for another time.

How do you identify unknown unknowns?

Brad L.

Solution & Partner Architecture, Senior Director | Hybrid & Cloud | Security | Alliances | I help companies innovate and grow 3x faster with pragmatic solutions and alliances.

2 年

I love the tie into unknown unknowns as I use it a bit in the security space I am in right now. We abstract out to a “known pattern of life” at the network layer that a AI model can chew on. What we have seen it helps with those longer tail issues. Think a small minimal behavioral pattern change starting on an instance then spreading to a fleet gradually over time. In my space that maybe an indicator of lateral movement. Great stuff as always, Dave!

Denise Rocha

20+ years nursing/clinical informatics and healthcare IT implementation experience, BSN RN and Epic Principal Trainer for Inpatient Orders at Tufts Medicine; Clinical Instructor GLTHS LPN Program; Home Care RN

2 年

Analysis of precursors to identify predictive data prior to previous outages?

Wolfgang Ihloff

Enabling Generative Artificial Intelligence for the Enterprise - Product Leader @ Aleph Alpha | Getting things done

2 年

For me this sparked the idea of watching non-events closely and start identifying root causes from them. Likely we see before an outage several non-events, what is our proactivity we can take there to prevent real outages? A talk Adrian Cockcroft gave already a few years back on it https://www.youtube.com/watch?v=C9VchTAd7AM

要查看或添加评论,请登录

社区洞察

其他会员也浏览了