The Availability Enigma
Created with Photoshop Express

The Availability Enigma

What’s availability?

One of the slipperiest terms in site reliability engineering (SRE) is availability. It is intended to be a crucial indicator for web site uptime, even though it is a notoriously shallow metric. But how do you calculate it? If you follow the Google doctrine, there are two approaches identified — time-based and aggregate availability. Time-based is the one most of us are familiar with. It’s the ratio of uptime to total time in a given time period. Aggregate availability is based on the ratio of a specific metric, such as the ratio of errors to total requests. These two approaches can produce very different outcomes, so let’s take an example and get into the weeds.

Example calculations

Consider the following hypothetical outage. The outage lasts for an hour, during which half of all transactions fail. Everything else is normal. There is data available for the error rate and response time, so these will be our service level indicators (SLIs). We also have service level objectives (SLOs) and an SLA.

  • Response time, SLO is < 350 ms
  • Error rate, SLO is < 2%
  • Availability, SLA > 99.9%

I'm trying to keep this simple so we can focus on the differences in methods of calculation. So let's further clarify that the incident starts and stops instantaneously. The kind of transaction isn't relevant, but think of it as a simple, one-shot service like a link shortener or file conversion.

Here's what the error rate and response time data looks like for the period around our outage. We will use these to help determine how to calculate availability.

No alt text provided for this image

Time-based availability

Using this method, we assess that the error rate objective was breached for a full 60 minutes. Even though there were transactions that were successful during this time, the system itself was in a state of fault for the entire hour. This is a conservative view of downtime, but represents the idea that as long as the system is not operating normally, it is incurring downtime. There is no relativism here — anything that is out of SLO is downtime.

So it follows that we will use one hour of downtime for the availability calculation. Assuming we are doing monthly reporting (for a month with 30 days, or 720 hours), that means:

No alt text provided for this image

Aggregate availability

An aggregate availability calculation can solve one of the issues noted above: the error rate was breached for a full hour, but only half of the transactions were affected. In fact, there are many customers who would not even be aware that something was wrong. By taking the full hour as downtime, we are opening ourselves up to SLA claims from parties that didn’t experience availability problems and that seems wrong. We need more numbers to calculate this, so let's say that our application always gets 100 requests per hour and also assume that the error rate when it is healthy is zero. There are 720 hours in a 30 day month, so that's 72,000 total requests. During the outage, 50%, or 50 requests, failed. The aggregate calculation is:

No alt text provided for this image

It turns out that this is the same as multiplying the error rate by the time of the outage in the time-based availability calculation above.

Other complicating circumstances for your enjoyment

Now, here is another twist. What if you discovered that even though 50% of the transactions failed, only five percent of all customers were affected. Those customers had a pretty bad experience for that hour, but all the rest of the customers were fine.

No alt text provided for this image

If you perform an aggregate availability calculation using customers impacted to total customers, you get an even higher availability number. I'm including this to be provocative, because while it is not a common SLO, it does illustrate the shallowness of these metrics. You could spend a lot of time looking around for metrics that make you look better under certain hardships. That is definitely NOT the point of what we are doing. We are trying to pick metrics that accurately reflect the customer experience. But I digress. The availability in this case is:

No alt text provided for this image

Conclusion

As you can see, the way you interpret the metrics and associated SLOs can have a large impact on your availability results. Here, we went from below three nines (< 99.9%) to above four nines (> 99.99%) just by shifting approaches. Each of those numbers tells a different story, but it is a shallow narrative. If you use the time-based approach, it exaggerates the customer impact. If you use error rate, it doesn’t say how those errors are distributed. And so on.

There are some reasonable motivations for using these different methods, though. Using time-based availability gives us a fault-based view into the service. It produces a set of outage durations, essentially, which you can then use to calculate the mean time to restore and mean time between failure (MMTR & MTBF), if you are wont to do that.

Time-based calculations have a major drawback — they don’t reflect the actual customer experience. In the first aggregate availability example, the downtime (60 minutes) is effectively pro-rated by the impact, 50%. This means your impact calculation is based on the area below the error rate curve. That makes a lot of sense and is the best overall tradeoff for this example.

The third calculation shows how you can take this too far. You can find metrics to support all kinds of numbers. But do they represent the typical user experience? Likewise, it is important to be consistent across services so that availability means the same thing, at least for major classes of products or technology. Then, hopefully everyone can agree that up is up and down is down.


要查看或添加评论,请登录

David Owczarek的更多文章

  • Please Give Me the Power

    Please Give Me the Power

    This is a crossover story. It's about audio engineering, but also about reliability engineering.

  • Podcasting Internet Failures

    Podcasting Internet Failures

    About six months ago, I was thinking about putting together a regular podcast or perhaps a newsletter to review major…

  • 4 Ways Performing Is Like Programming

    4 Ways Performing Is Like Programming

    The Set-Up When I started getting ready to perform music as a solo artist, I learned a number of humbling things about…

  • 6 Months and Counting

    6 Months and Counting

    It’s been six months since the layoff that put me back in the job market. It’s been crazy—in both good and bad ways.

  • 10 ways to ruin a lightning talk

    10 ways to ruin a lightning talk

    I'm submitting a lighting talk today for an upcoming SRECon. I haven't done a lightning talk before, and the format is…

    1 条评论
  • Five Timestamps; Four Metrics

    Five Timestamps; Four Metrics

    Introduction There are five timeline events that are so critical you should record them for every outage. This isn’t…

  • What is SRE really?

    What is SRE really?

    Hint: It’s not always what Google says Last year, I presented at SRECon EMEA on the topic of the biases confronting…

    3 条评论
  • The 2023 State of DevOps?Report

    The 2023 State of DevOps?Report

    Background The 2023 State of DevOps report was released recently, and there are some interesting things to discuss…

    1 条评论
  • SLOConf 2022 - 8 inspiring talks

    SLOConf 2022 - 8 inspiring talks

    SLOConf 2022 is happening right now. I have been watching the content and thinking about service level objectives…

  • Two learnings from SRECon?2022

    Two learnings from SRECon?2022

    MTT* metrics suck and we are still learning how to SRE Any questions? You gotta love a conference that opens with a…

    1 条评论

社区洞察

其他会员也浏览了