The Availability Enigma
David Owczarek
Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud
What’s availability?
One of the slipperiest terms in site reliability engineering (SRE) is availability. It is intended to be a crucial indicator for web site uptime, even though it is a notoriously shallow metric. But how do you calculate it? If you follow the Google doctrine, there are two approaches identified — time-based and aggregate availability. Time-based is the one most of us are familiar with. It’s the ratio of uptime to total time in a given time period. Aggregate availability is based on the ratio of a specific metric, such as the ratio of errors to total requests. These two approaches can produce very different outcomes, so let’s take an example and get into the weeds.
Example calculations
Consider the following hypothetical outage. The outage lasts for an hour, during which half of all transactions fail. Everything else is normal. There is data available for the error rate and response time, so these will be our service level indicators (SLIs). We also have service level objectives (SLOs) and an SLA.
I'm trying to keep this simple so we can focus on the differences in methods of calculation. So let's further clarify that the incident starts and stops instantaneously. The kind of transaction isn't relevant, but think of it as a simple, one-shot service like a link shortener or file conversion.
Here's what the error rate and response time data looks like for the period around our outage. We will use these to help determine how to calculate availability.
Time-based availability
Using this method, we assess that the error rate objective was breached for a full 60 minutes. Even though there were transactions that were successful during this time, the system itself was in a state of fault for the entire hour. This is a conservative view of downtime, but represents the idea that as long as the system is not operating normally, it is incurring downtime. There is no relativism here — anything that is out of SLO is downtime.
So it follows that we will use one hour of downtime for the availability calculation. Assuming we are doing monthly reporting (for a month with 30 days, or 720 hours), that means:
Aggregate availability
An aggregate availability calculation can solve one of the issues noted above: the error rate was breached for a full hour, but only half of the transactions were affected. In fact, there are many customers who would not even be aware that something was wrong. By taking the full hour as downtime, we are opening ourselves up to SLA claims from parties that didn’t experience availability problems and that seems wrong. We need more numbers to calculate this, so let's say that our application always gets 100 requests per hour and also assume that the error rate when it is healthy is zero. There are 720 hours in a 30 day month, so that's 72,000 total requests. During the outage, 50%, or 50 requests, failed. The aggregate calculation is:
领英推荐
It turns out that this is the same as multiplying the error rate by the time of the outage in the time-based availability calculation above.
Other complicating circumstances for your enjoyment
Now, here is another twist. What if you discovered that even though 50% of the transactions failed, only five percent of all customers were affected. Those customers had a pretty bad experience for that hour, but all the rest of the customers were fine.
If you perform an aggregate availability calculation using customers impacted to total customers, you get an even higher availability number. I'm including this to be provocative, because while it is not a common SLO, it does illustrate the shallowness of these metrics. You could spend a lot of time looking around for metrics that make you look better under certain hardships. That is definitely NOT the point of what we are doing. We are trying to pick metrics that accurately reflect the customer experience. But I digress. The availability in this case is:
Conclusion
As you can see, the way you interpret the metrics and associated SLOs can have a large impact on your availability results. Here, we went from below three nines (< 99.9%) to above four nines (> 99.99%) just by shifting approaches. Each of those numbers tells a different story, but it is a shallow narrative. If you use the time-based approach, it exaggerates the customer impact. If you use error rate, it doesn’t say how those errors are distributed. And so on.
There are some reasonable motivations for using these different methods, though. Using time-based availability gives us a fault-based view into the service. It produces a set of outage durations, essentially, which you can then use to calculate the mean time to restore and mean time between failure (MMTR & MTBF), if you are wont to do that.
Time-based calculations have a major drawback — they don’t reflect the actual customer experience. In the first aggregate availability example, the downtime (60 minutes) is effectively pro-rated by the impact, 50%. This means your impact calculation is based on the area below the error rate curve. That makes a lot of sense and is the best overall tradeoff for this example.
The third calculation shows how you can take this too far. You can find metrics to support all kinds of numbers. But do they represent the typical user experience? Likewise, it is important to be consistent across services so that availability means the same thing, at least for major classes of products or technology. Then, hopefully everyone can agree that up is up and down is down.