Uptime Percentages, Recovery Time Objective and Error Budgets
Animesh Mukherjee
Experienced in large-scale hybrid IT operations with emphasis on cloud, cost, cyber and ITSM.
It is very common to talk about the number of ‘nines’ that an application is expected to be up and running, a short way to express the need to be 99.99% or a number like that. This is usually called the ‘uptime’ and expressed as a percentage of the time it has to be up for a period, say a month. However, it is important to take a deeper look and understand what it really means for the design, architecture, and operation of an application.
There are two other terms that are important in this context. Recovery Time Objective, aka RTO, is the maximum time the business can run with the application down. This is a business specification and is dependent on how the application is used in the business process. If this is an e-commerce application that is used on 24/7 basis by its end-users, the RTO has to be very short, in the order of seconds or minutes. Otherwise, the business is losing money on potential customers who are unable to place orders. But there are many applications where it may be inconvenient or a nuisance when it is down, but the business process may be unaffected because of the time of day or week, or the availability of alternatives. For example, a financial reporting application may be critical for a few days after month end for the books to be closed, but not critical after that work is completed. Since the RTO determines the architecture we choose and its redundancy, designing for an RTO lower than what is really needed by the business adds unnecessary cost.
Error budgets tell us how much time in minutes we have left during the measurement period in order not to break the uptime percentage specified. For example, if an application needs an uptime of 99.5% each month, that means it can be down for just over 200 minutes before this is breached (30 days * 24 hours * 60 minutes * 0.5%. We ignore the differing durations of months). This is the error budget for this application. Every time it is down, we need to subtract from this budget and as we run out, need to take steps to avoid crossing it. For example, we may postpone all changes to the application and its infrastructure till the next month to avoid breaching the threshold when we have only a few minutes left this month.
The error budget also tells us how quickly we have to fix any issue and bring the application back up if something happens. There is a minimum time needed to detect the failure, even when automated tools are used, followed by the time for the response. It is unlikely that any triage even starts before 10-15 minutes have passed, and even the quickest troubleshooting will take 20 minutes or more. This informs us about the design choices we have.
An application with an uptime of 99.999% (five nines) has an error budget of 0.4 minutes. This downtime every month is only possible if the application runs in two separate locations (availability zones in cloud parlance) and uses a global load balancer to automatically send traffic to the working instance. This is very expensive both to design and operate and can only be justified if the cost per minute of downtime exceeds the additional cost.
领英推è
This is the most important takeaway from this discussion: ask the business what the cost of downtime is per minute. For example, an e-commerce application that sells $50 million per year, with 50% seasonality in the final 2 months of the year will lose ~$58 per minute during the low season 10 months of the year [$25,000,000/ ((365-61)*24*60)]. A 200-minute error budget will cost $12,000 in sales if it is used up and has to be compared to the cost of the redundant design. However, in the busy season this cost is ~$290 per minute in lost sales and 200 minutes will cost $58,000. ?In this case it is better to bring up the full redundancy only during the busy season and freeze all changes during that time!
To summarize, you need to understand the business cost of downtime and then compare it with the cost of redundancy to choose the right error budget and implementation that meets the business needs. For cloud-based applications this can mean bringing up the redundant sites only during the busy season. When operating the application, track every minute of downtime and make sure you stay within the error budget. Finally, investigate every outage and permanently fix the root cause so it never occurs again.
At Tailwinds we are helping teams design, build, deploy and operate cloud-native applications securely with lower cost and faster time to market using our Internal Developer Platform (IDP) product - MajorDomo.
CEO/Founder Tailwinds.ai
2 å¹´Very good article Animesh Mukherjee. Totally agree on the SLO needs to be business driven and not a technology problem. One can spend a lot of time and effort to solve a technology problem that does not have business drivers