ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Uptime Percentages, Recovery Time Objective and Error Budgets

Animesh Mukherjee

Experienced in large-scale hybrid IT operations with emphasis on cloud, cost, cyber and ITSM.

å‘å¸ƒæ—¥æœŸ: 2023å¹´1æœˆ9æ—¥

It is very common to talk about the number of â€˜ninesâ€™ that an application is expected to be up and running, a short way to express the need to be 99.99% or a number like that. This is usually called the â€˜uptimeâ€™ and expressed as a percentage of the time it has to be up for a period, say a month. However, it is important to take a deeper look and understand what it really means for the design, architecture, and operation of an application.

There are two other terms that are important in this context. Recovery Time Objective, aka RTO, is the maximum time the business can run with the application down. This is a business specification and is dependent on how the application is used in the business process. If this is an e-commerce application that is used on 24/7 basis by its end-users, the RTO has to be very short, in the order of seconds or minutes. Otherwise, the business is losing money on potential customers who are unable to place orders. But there are many applications where it may be inconvenient or a nuisance when it is down, but the business process may be unaffected because of the time of day or week, or the availability of alternatives. For example, a financial reporting application may be critical for a few days after month end for the books to be closed, but not critical after that work is completed. Since the RTO determines the architecture we choose and its redundancy, designing for an RTO lower than what is really needed by the business adds unnecessary cost.

Error budgets tell us how much time in minutes we have left during the measurement period in order not to break the uptime percentage specified. For example, if an application needs an uptime of 99.5% each month, that means it can be down for just over 200 minutes before this is breached (30 days * 24 hours * 60 minutes * 0.5%. We ignore the differing durations of months). This is the error budget for this application. Every time it is down, we need to subtract from this budget and as we run out, need to take steps to avoid crossing it. For example, we may postpone all changes to the application and its infrastructure till the next month to avoid breaching the threshold when we have only a few minutes left this month.

The error budget also tells us how quickly we have to fix any issue and bring the application back up if something happens. There is a minimum time needed to detect the failure, even when automated tools are used, followed by the time for the response. It is unlikely that any triage even starts before 10-15 minutes have passed, and even the quickest troubleshooting will take 20 minutes or more. This informs us about the design choices we have.

An application with an uptime of 99.999% (five nines) has an error budget of 0.4 minutes. This downtime every month is only possible if the application runs in two separate locations (availability zones in cloud parlance) and uses a global load balancer to automatically send traffic to the working instance. This is very expensive both to design and operate and can only be justified if the cost per minute of downtime exceeds the additional cost.

é¢†è‹±æŽ¨è

Downtime costs organizations up to $1.9M per hour. Here are 5 factors proven to reduce outages.

Downtime costs organizations up to $1.9M per hourâ€¦

New Relic 2 ä¸ªæœˆå‰

The Hidden Costs of Genset Downtime: Why Your Business Canâ€™t Afford Fuel Shortages

The Hidden Costs of Genset Downtime: Why Your Businessâ€¦

FuelBuddy 2 å‘¨å‰

The Cost of Downtime: How Real-time Monitoring Can Drive Corrective Actions

The Cost of Downtime: How Real-time Monitoring Canâ€¦

Datatechvibe 1 å¹´å‰

This is the most important takeaway from this discussion: ask the business what the cost of downtime is per minute. For example, an e-commerce application that sells $50 million per year, with 50% seasonality in the final 2 months of the year will lose ~$58 per minute during the low season 10 months of the year [$25,000,000/ ((365-61)*24*60)]. A 200-minute error budget will cost $12,000 in sales if it is used up and has to be compared to the cost of the redundant design. However, in the busy season this cost is ~$290 per minute in lost sales and 200 minutes will cost $58,000. ?In this case it is better to bring up the full redundancy only during the busy season and freeze all changes during that time!

To summarize, you need to understand the business cost of downtime and then compare it with the cost of redundancy to choose the right error budget and implementation that meets the business needs. For cloud-based applications this can mean bringing up the redundant sites only during the busy season. When operating the application, track every minute of downtime and make sure you stay within the error budget. Finally, investigate every outage and permanently fix the root cause so it never occurs again.

At Tailwinds we are helping teams design, build, deploy and operate cloud-native applications securely with lower cost and faster time to market using our Internal Developer Platform (IDP) product - MajorDomo.

#SLO #errorbudget #cloudnative #sre #platformengineering #internaldeveloperplatform #itbm #itsm #itom

Bhargav Bhikkaji

CEO/Founder Tailwinds.ai

2 å¹´

Very good article Animesh Mukherjee. Totally agree on the SLO needs to be business driven and not a technology problem. One can spend a lot of time and effort to solve a technology problem that does not have business drivers

èµž

å›žå¤

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Animesh Mukherjeeçš„æ›´å¤šæ–‡ç«

Use FinOps to optimize Cloud Ops

2023å¹´11æœˆ28æ—¥

Use FinOps to optimize Cloud Ops

Introduction The business and finance functions in companies have been managing the investments and costs of IT forâ€¦

5 æ¡è¯„è®º
Right-sizing in the Cloud

2023å¹´5æœˆ3æ—¥

Right-sizing in the Cloud

I have recently been helping clients understand and control their cloud costs and finding some very easy ways to saveâ€¦

4 æ¡è¯„è®º
Choosing On-Prem vs. Public Cloud

2023å¹´3æœˆ3æ—¥

Choosing On-Prem vs. Public Cloud

There was an article recently on Linked In by David Heinemeier Hansson about his companyâ€™s decision to leave the publicâ€¦

6 æ¡è¯„è®º
Business Process Expectations and Messaging Systems like Kafka

2023å¹´2æœˆ10æ—¥

Business Process Expectations and Messaging Systems like Kafka

Messaging systems like Kafka are used to distribute messages and data streams in all sorts of applications, mostly inâ€¦
Be Prepared

2023å¹´2æœˆ4æ—¥

Be Prepared

This used to be the slogan of the Boy Scouts, but it applies to all of us for many aspects of life. As I deal with theâ€¦

2 æ¡è¯„è®º
How Platform Engineering Helps Meet C-Suite Expectations

2022å¹´11æœˆ30æ—¥

How Platform Engineering Helps Meet C-Suite Expectations

Digital transformation and application modernization increasingly means building a cloud-native application hosted by aâ€¦

4 æ¡è¯„è®º
Why are empty roads lit up at night?

2021å¹´12æœˆ9æ—¥

Why are empty roads lit up at night?

While approaching New Delhi on a flight at 2am a week ago, I noticed that while most of the land, houses etc. wereâ€¦

2 æ¡è¯„è®º
People are the Most Important in the People, Process, Technology Triad

2021å¹´7æœˆ10æ—¥

People are the Most Important in the People, Process, Technology Triad

â€œYour bag has arrived, I had it delivered to your room!â€, she shouted as she watched me approach her concierge deskâ€¦

2 æ¡è¯„è®º
Stoking Creativity

2021å¹´6æœˆ24æ—¥

Stoking Creativity

Bettyâ€™s husband always used the â€˜yes, butâ€™ to respond in conversations. It was so bad that she and her friends used toâ€¦
Opening Up â€“ to new ideas â€¦

2021å¹´6æœˆ16æ—¥

Opening Up â€“ to new ideas â€¦

All over the US, and especially here in California, restrictions imposed due to COVID are being eased and businessesâ€¦

See all articles

Uptime Percentages, Recovery Time Objective and Error Budgets

Animesh Mukherjee

Experienced in large-scale hybrid IT operations with emphasis on cloud, cost, cyber and ITSM.

é¢†è‹±æŽ¨è

Animesh Mukherjeeçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

IT Departments can add significant business value. But often donâ€™t.

Resilience4j - Fault Tolerant Micro services - Part II

Five Timestamps; Four Metrics

Unexpected User Acceptance

Hot Restart in Envoy: Zero Downtime Upgrades

Building Resilient and Fault-Tolerant Systems: An In-Depth Guide

Are You Buying Your UPS Based On Price Or Based On Cost?

Whatever Happened to Five 9s Reliability?

Uptime vs. TIA-942: Outcome based or checklist or can it be both?

Unmasking the True Price of IT Downtime

é¢†è‹±æŽ¨è

Animesh Mukherjeeçš„æ›´å¤šæ–‡ç«

Use FinOps to optimize Cloud Ops

Right-sizing in the Cloud

Choosing On-Prem vs. Public Cloud

Business Process Expectations and Messaging Systems like Kafka

Be Prepared

How Platform Engineering Helps Meet C-Suite Expectations

Why are empty roads lit up at night?

People are the Most Important in the People, Process, Technology Triad

Stoking Creativity

Opening Up â€“ to new ideas â€¦

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

IT Departments can add significant business value. But often donâ€™t.

Resilience4j - Fault Tolerant Micro services - Part II

Five Timestamps; Four Metrics

Unexpected User Acceptance

Hot Restart in Envoy: Zero Downtime Upgrades

Building Resilient and Fault-Tolerant Systems: An In-Depth Guide

Are You Buying Your UPS Based On Price Or Based On Cost?

Whatever Happened to Five 9s Reliability?

Uptime vs. TIA-942: Outcome based or checklist or can it be both?

Unmasking the True Price of IT Downtime

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†