Service Level Agreement Part 3

 

Part 1 and Part 2 of this series covered the basics of probability and service level agreements. Now it is time to get serious about calculating SLAs for bigger systems and look at some more real-life examples. 

Azure SLAs 

As said before, I’m an Azure guy. I will use Azure in these rest of this article, but the same is true for Google Cloud and AWS. I just don’t know the terminology to talk about those and sound smart. 

Below is a list of a few published SLAs from Microsoft for some Azure services.  

No alt text provided for this image

When you think about what an SLA really is, it stands for a probability that when you check on the status of the system, it will be online. For instance, with AKS, you can see above that the SLA is 99.95%. That means the probability that AKS is online at any time is 0.9995 (just take the percentage and turn it into a decimal). As we described in Part 1 of this series, thinking about the probability that the opposite is true (AKS is down) is 1-0.9995 = 0.0005 

Regional Services vs. Non-Regional Services 

In Azure, you deploy your solutions to whatever datacenter or “region” is the most beneficial to you. There are several regions available, as is documented at Choose the Right Azure Region for You | Microsoft Azure. You may choose a region based upon the service you wish to use, where your offices are found, or for some sort of compliance rules on where your data can be stored.  

Whatever the case, most of the services you use will be “regional”. This means that you will place it into a specific region, and it will just run there. In case of an outage of the entire region, your service will be down. To create higher availability than that, you will need to deploy your solution to at least two regions and supply some sort of load balancing or traffic routing for high availability. Azure regions are already paired for you for some things, so you should follow the same rules. The pairings are documented here

Some Math 

When you start to combine services within a region, to calculate the SLA of your entire solution, you simply multiply the SLAs for each. Why is that? Well, this works because you can reasonably assume the independence of many of the services, and remember: 

P(A and B) = P(A) * P(B). 

Using the chart from before, if we have a solution that includes AKS, the AKS nodes, and an Azure Key Vault, we can say the SLA for the whole solution is: .9995 * .9999 * .999 = 0.9984 => 99.84%. What does that work out to in terms of uptime? From https://uptime.is/99.84, we see that leaves over 2 mins of interruption per day.   

I don’t know about you, but that doesn’t sound so good! How can we improve this? All three of the services mentioned are regional, so maybe we can deploy all three into multiple regions and provide some sort of traffic routing in front to keep things running.  

Let’s assume we deploy all 3 services identically into two paired regions. Let’s also assume that we have traffic management system (ahem, there’s a service called Traffic Manager in Azure you can read about here) that routes users as necessary to one or the other region. When a region is down, all the traffic goes to the region that works, and when both regions are up, the traffic is load balanced between them. 

We can now think of that 99.84% SLA as being the SLA for just a single region. If we have two regions, A and B, then the probability of both regions being up is  

P(A and B) = P(A) * P(B) = .9984 * .9984 = 0.9968 => 99.68%.  

BUT WAIT!!  That’s worse than just having one region! Ah, that’s because we do not need both regions to be up. We are not concerned with P(A and B)! We are only concerned about the probability that both are DOWN, which would mean that our app is down. That probability is P(-A and –B) = .0016 *.0016 = 0.00000256.  If the probability of both being down is .00000256, then the probability that at least one is up is: 

1 - .00000256 = 0.99999744 => 99.9997% SLA. 

Wow! Now that’s an SLA! Looking at https://uptime.is/99.9997, we see that’s only 1 minute 24 seconds of downtime PER YEAR!!  Now that’s some magnificent work! 

But Wait...What about Non-Regional Services?

Here’s where things get trickier. There are Azure services that are non-regional, meaning that you cannot create any kind of higher availability by deploying them multiple times. One such service is Azure Active Directory (AAD), which is a lynchpin in many solutions, as it provides authentication services. AAD has a published SLA of only 99.9%.  

If we consider the same solution as above and add in AAD, we see that instead of multiplying it in with each region, we must multiply it as a separate part of the last calculation. Instead of just P(A and B), we have a “C” that we must include, so: 

P(A and B and C) = .9984 * .9984 *.999 = .9958.  

No matter what we’ve done with our regional availability, AAD hurts us. 

Is that Really Right? 

If AAD is down, does our solution work at all? The answer is probably no! What does that mean for our probability?  

Well, now we must talk about dependent events. We have, thus far, treated the regional services independently. But really, if AAD doesn’t work, nothing works! Therefore, the regions are DEPENDENT upon AAD!  

If we think about a Venn Diagram to show this, consider the following: 

No alt text provided for this image

If AAD is the C in the diagram, then what we need to find out is the probabilities of “C and A”, “C and B”, and “C and A and B”.  That would be  

P(C and A) = .999 *.9984 = 0.9974016 

P(C and B) = .999 * .9984 = 0.9974016 

P(C and A and B) = .999 * .9984 * .9984 = 0.99580575744 

What we come out with is that the probability of being either all up or having one region up with AAD is: 

P(C and A) + P(C and B) - P(C and A and B) = 0.9989 => 99.89% SLA. 

Why the subtraction? That little area in the middle is counted both in the probability of A and C, as well as in the probability of B and C, so we must take away one of them or it will get doubled.  

Your Support Team 

Another thing to consider is what you will tell your support teams. We achieved a great level of availability for our example, but that doesn’t mean that the support staff will only have to work for a minute a year! The probability they are concerned with is not the same. They are only concerned with maximizing the time that EVERYTHING is working. We found that to be: 

P(A and B and C) = 0.9958.   

Using https://uptime.is/99.58, we see that that provides a little over 3 hours of work a month. You’d be wise to order your support folk some cookies. 

Conclusion 

Calculating SLAs for complex solutions is not a straightforward thing. Even in our example, we still did not consider the chance that our load balancer is down or downtime for any of the foundational infrastructure (storage accounts, virtual networking, firewalls, etc.) that are necessary to run our system.  

The deeper you take this exercise, including more stuff in your calculation, you see that your solution is never going to achieve 100% uptime. This is exactly why Microsoft and Google recommend building your system with that in mind! Make sure that your development team knows this, too, and does things like configures retries, sets timeouts, and does error handling appropriately.  

All told, this is not to say that the cloud sucks. In fact, if you really start digging into on-premises systems and their historical service levels, I argue that you will see similar numbers. In the cloud, people just admit it. 

 

要查看或添加评论,请登录

Chris S.的更多文章

  • A Question for Data People

    A Question for Data People

    A little background: I'm an old math geek. I took darn near every undergraduate math class offered at both Morehead…

  • PowerShell Modules Rule!

    PowerShell Modules Rule!

    Say you have CI/CD pipelines. You have Azure DevOps (ADO) and are finally using YML pipelines.

  • Low Code "Revolution"

    Low Code "Revolution"

    I saw an advertisement for Brainboard (Brainboard | Design, Deploy and Manage Multi-Cloud) this morning. I looked into…

    1 条评论
  • A Terrible Terraform Pattern

    A Terrible Terraform Pattern

    Here's a scenario I've seen in multiple enterprises using Azure. Company decides to go with Terraform for all their…

  • A Fundamental Mistake in "DevOps"

    A Fundamental Mistake in "DevOps"

    I've been working as a "DevOps Engineer" for about 8 years, having been an infrastructure guy for about 15 years before…

    1 条评论
  • Right Level of Automation

    Right Level of Automation

    I believe in automation and CI/CD..

  • Skepticism of Competence

    Skepticism of Competence

    My wife said something to me yesterday that I've been really thinking about now for the last 24 hours. She's worked in…

    5 条评论
  • Service Level Agreements Part 2

    Service Level Agreements Part 2

    Part 1 Hopefully, folks are feeling "refreshed" after viewing Part 1 of this series. So now let's talk about Service…

  • Probability and SLAs, Part 1

    Probability and SLAs, Part 1

    I recorded this quickly today as a refresher on probability. There are some links in the slides that I go through that…

  • A Series on How to Calculate Service Level Agreements

    A Series on How to Calculate Service Level Agreements

    When you sign up for a specific service, you are promised a percentage of time that the service will be available; this…

社区洞察

其他会员也浏览了