登录查看更多内容

The Availability Enigma

David Owczarek

Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud

发布日期: 2022年7月20日

What’s availability?

One of the slipperiest terms in site reliability engineering (SRE) is availability. It is intended to be a crucial indicator for web site uptime, even though it is a notoriously shallow metric. But how do you calculate it? If you follow the Google doctrine, there are two approaches identified — time-based and aggregate availability. Time-based is the one most of us are familiar with. It’s the ratio of uptime to total time in a given time period. Aggregate availability is based on the ratio of a specific metric, such as the ratio of errors to total requests. These two approaches can produce very different outcomes, so let’s take an example and get into the weeds.

Example calculations

Consider the following hypothetical outage. The outage lasts for an hour, during which half of all transactions fail. Everything else is normal. There is data available for the error rate and response time, so these will be our service level indicators (SLIs). We also have service level objectives (SLOs) and an SLA.

Response time, SLO is < 350 ms
Error rate, SLO is < 2%
Availability, SLA > 99.9%

I'm trying to keep this simple so we can focus on the differences in methods of calculation. So let's further clarify that the incident starts and stops instantaneously. The kind of transaction isn't relevant, but think of it as a simple, one-shot service like a link shortener or file conversion.

Here's what the error rate and response time data looks like for the period around our outage. We will use these to help determine how to calculate availability.

Time-based availability

Using this method, we assess that the error rate objective was breached for a full 60 minutes. Even though there were transactions that were successful during this time, the system itself was in a state of fault for the entire hour. This is a conservative view of downtime, but represents the idea that as long as the system is not operating normally, it is incurring downtime. There is no relativism here — anything that is out of SLO is downtime.

So it follows that we will use one hour of downtime for the availability calculation. Assuming we are doing monthly reporting (for a month with 30 days, or 720 hours), that means:

Aggregate availability

An aggregate availability calculation can solve one of the issues noted above: the error rate was breached for a full hour, but only half of the transactions were affected. In fact, there are many customers who would not even be aware that something was wrong. By taking the full hour as downtime, we are opening ourselves up to SLA claims from parties that didn’t experience availability problems and that seems wrong. We need more numbers to calculate this, so let's say that our application always gets 100 requests per hour and also assume that the error rate when it is healthy is zero. There are 720 hours in a 30 day month, so that's 72,000 total requests. During the outage, 50%, or 50 requests, failed. The aggregate calculation is:

领英推荐

New workloads and subcomponent obsolescence are…

Uptime Institute 1 个月前

Building Dependable IT Systems

Crossjoin Solutions 2 个月前

Alerting on SLOs and Error Budget Policies

Cprime, Inc 2 年前

It turns out that this is the same as multiplying the error rate by the time of the outage in the time-based availability calculation above.

Other complicating circumstances for your enjoyment

Now, here is another twist. What if you discovered that even though 50% of the transactions failed, only five percent of all customers were affected. Those customers had a pretty bad experience for that hour, but all the rest of the customers were fine.

If you perform an aggregate availability calculation using customers impacted to total customers, you get an even higher availability number. I'm including this to be provocative, because while it is not a common SLO, it does illustrate the shallowness of these metrics. You could spend a lot of time looking around for metrics that make you look better under certain hardships. That is definitely NOT the point of what we are doing. We are trying to pick metrics that accurately reflect the customer experience. But I digress. The availability in this case is:

Conclusion

As you can see, the way you interpret the metrics and associated SLOs can have a large impact on your availability results. Here, we went from below three nines (< 99.9%) to above four nines (> 99.99%) just by shifting approaches. Each of those numbers tells a different story, but it is a shallow narrative. If you use the time-based approach, it exaggerates the customer impact. If you use error rate, it doesn’t say how those errors are distributed. And so on.

There are some reasonable motivations for using these different methods, though. Using time-based availability gives us a fault-based view into the service. It produces a set of outage durations, essentially, which you can then use to calculate the mean time to restore and mean time between failure (MMTR & MTBF), if you are wont to do that.

Time-based calculations have a major drawback — they don’t reflect the actual customer experience. In the first aggregate availability example, the downtime (60 minutes) is effectively pro-rated by the impact, 50%. This means your impact calculation is based on the area below the error rate curve. That makes a lot of sense and is the best overall tradeoff for this example.

The third calculation shows how you can take this too far. You can find metrics to support all kinds of numbers. But do they represent the typical user experience? Likewise, it is important to be consistent across services so that availability means the same thing, at least for major classes of products or technology. Then, hopefully everyone can agree that up is up and down is down.

要查看或添加评论，请登录

David Owczarek的更多文章

Please Give Me the Power

2024年12月5日

Please Give Me the Power

This is a crossover story. It's about audio engineering, but also about reliability engineering.
Podcasting Internet Failures

2024年6月20日

Podcasting Internet Failures

About six months ago, I was thinking about putting together a regular podcast or perhaps a newsletter to review major…
4 Ways Performing Is Like Programming

2024年5月28日

4 Ways Performing Is Like Programming

The Set-Up When I started getting ready to perform music as a solo artist, I learned a number of humbling things about…
6 Months and Counting

2024年4月23日

6 Months and Counting

It’s been six months since the layoff that put me back in the job market. It’s been crazy—in both good and bad ways.
10 ways to ruin a lightning talk

2024年2月6日

10 ways to ruin a lightning talk

I'm submitting a lighting talk today for an upcoming SRECon. I haven't done a lightning talk before, and the format is…

1 条评论
Five Timestamps; Four Metrics

2023年12月6日

Five Timestamps; Four Metrics

Introduction There are five timeline events that are so critical you should record them for every outage. This isn’t…
What is SRE really?

2023年11月28日

What is SRE really?

Hint: It’s not always what Google says Last year, I presented at SRECon EMEA on the topic of the biases confronting…

3 条评论
The 2023 State of DevOps?Report

2023年11月8日

The 2023 State of DevOps?Report

Background The 2023 State of DevOps report was released recently, and there are some interesting things to discuss…

1 条评论
SLOConf 2022 - 8 inspiring talks

2022年5月11日

SLOConf 2022 - 8 inspiring talks

SLOConf 2022 is happening right now. I have been watching the content and thinking about service level objectives…
Two learnings from SRECon?2022

2022年4月5日

Two learnings from SRECon?2022

MTT* metrics suck and we are still learning how to SRE Any questions? You gotta love a conference that opens with a…

1 条评论

See all articles

The Availability Enigma

David Owczarek

Senior leader, writer, and speaker focused on SRE/DevOps and operating compliant, secure services at scale in the cloud

What’s availability?

Example calculations

Time-based availability

Aggregate availability

领英推荐

Other complicating circumstances for your enjoyment

Conclusion

David Owczarek的更多文章

社区洞察

其他会员也浏览了

Duplication in Infrastructure Documentation is a Good Result!

Why are infrastructure engineers so busy? Can it be changed?

How we helped major utility companies achieve quicker power restoration—Improved Outage Management for Increased Customer Satisfaction

Incident Response - More on the Windows PEB

Service Reliability Is More Than Just Uptime: A Deep Dive Into the Math Behind It

Headline: iDEP Milestone- 694 Projects loaded as System Enhancements Continue

Using Site Reliability Engineering to Increase Operational Resilience

What is Load Balancing?

How to calculate availability of the system?

Job Opportunity: IT Infrastructure Solutions Architect in Tumwater, WA!

What’s availability?

Example calculations

Time-based availability

Aggregate availability

领英推荐

Other complicating circumstances for your enjoyment

Conclusion

David Owczarek的更多文章

Please Give Me the Power

Podcasting Internet Failures

4 Ways Performing Is Like Programming

6 Months and Counting

10 ways to ruin a lightning talk

Five Timestamps; Four Metrics

What is SRE really?

The 2023 State of DevOps?Report

SLOConf 2022 - 8 inspiring talks

Two learnings from SRECon?2022

社区洞察

其他会员也浏览了

Duplication in Infrastructure Documentation is a Good Result!

Why are infrastructure engineers so busy? Can it be changed?

How we helped major utility companies achieve quicker power restoration—Improved Outage Management for Increased Customer Satisfaction

Incident Response - More on the Windows PEB

Service Reliability Is More Than Just Uptime: A Deep Dive Into the Math Behind It

Headline: iDEP Milestone- 694 Projects loaded as System Enhancements Continue

Using Site Reliability Engineering to Increase Operational Resilience

What is Load Balancing?

How to calculate availability of the system?

Job Opportunity: IT Infrastructure Solutions Architect in Tumwater, WA!