Service Level Agreements Part 2

Chris S.

发布日期: 2021年3月10日

Hopefully, folks are feeling "refreshed" after viewing Part 1 of this series. So now let's talk about Service Level Agreements (SLA).

What is an SLA?

The SLA is essentially a promise from a service provider about how well their service is going to work. Let's talk about some examples.

How long before you will get a call back from a support ticket? You've probably run into this one when you've called the cable company or your doctor's billing department and had to leave a message. While not all companies publish their SLAs, in most they are at least tracking how long these return calls take, so they can hit some target. Like "The SLA for returned calls is within 30 minutes".

How long will it take for you to get permissions added to a file that you've requested? This is like the support ticket, but maybe your permission request must flow through 2 or 3 approvers. Where I work, most permissions requests seem to have an SLA of 5 days, though I believe I've seen a few as low as 3 days.

SLAs for Cloud Services

Back in the "old days", folks kept full control of their datacenters. They would have to make sure the server racks were appropriately wired for power, check on the backup power and possibly a generator, wire everything up, and watch all the servers/networking gear hum along. By having this control, infrastructure teams could flex some muscle and do things like reboot servers in a careful manner, thus keeping their applications online in the meantime. They could patch servers in the middle of the night or during the least busy times, thus working often late in the night or on the weekends.

Then folks started outsourcing their datacenter work and moving to offsite datacenters. The work still existed, but their people didn't have to do it all. This didn't cut some of the hand holding that teams could do with their servers, like rebooting clustered servers carefully, but a lot of the late nights were stopped. Sometimes you could even send your own folks to the datacenters to do things like replace servers. Pretty handy!

The cloud is not so different! In the cloud, there are still datacenters, power systems, server hardware, and networking gear. The difference is that you don't get to go there at all. Microsoft, Amazon, and Google certainly will not allow a server engineer from Mom-and-Pop Bubble Gum Shop to walk in and run something on a server.

So, what do you use to make sure you're running? You use the Service Level Agreements for the cloud services you consume, and design everything KNOWING THAT YOU WILL NOT BE UP 100% OF THE TIME. No more "cheating" and hand-holding your clusters through reboots. No more building a single big database server to run everything and just babysitting the server to make sure it runs, thus saving your core licensing costs. You MUST design for failure in the cloud.

A Change in Thinking

You must change your thoughts when it comes to uptime. While your provider may beat your SLA, they absolutely do not have to, and for many your SLA is their only target to meet. Microsoft documents this in the first paragraph of their Principles of the Reliability Pillar document:

"Building a reliable application in the cloud is different from traditional application development. While historically you may have purchased levels of redundant higher-end hardware to minimize the chance of an entire application platform failing, in the cloud, we acknowledge up front that failures will happen."

Similarly, Google's Site Reliability Engineering book (https://sre.google/sre-book/table-of-contents/) makes it noticeably clear that SRE do not have a goal of keeping things running 100% of the time. They only have a goal to meet the SLA, and if they do not have as much downtime as the SLA allows, then they are not making enough changes. (Chapter 3 - Managing Risk)

A Cool Site

One of my favorite sites is https://uptime.is/. On this site, you input the SLA and it will calculate how much downtime you can expect on a daily, weekly, monthly, and yearly basis. This is a brilliant place to play with the numbers to see for yourself just how bad something like "99% uptime" really is ( > 3.5 days of downtime a year!)

Azure

I'm going to focus on Azure only, as that's what I really know. Google (GCP) and AWS both have similar setups, so just take this info and use their own SLAs.

You can find the SLAs for all the cloud services offered by Azure at the following site: https://azure.microsoft.com/en-us/support/legal/sla/. Each service offered has a different SLA associated with it, and there are some caveats.

For instance, with Virtual Machines (VMs), you will see that Microsoft provides an incredibly low SLA on the least-expensive single instance VMs (95%). You can expect > 1 hour of downtime a day from such an SLA. So if you're going to run a simple webpage, like maybe your personal blog, you can expect the VM it's running on to only be accessible 95% of the time. By moving to an Availability Set, though, with at least two VMs, you can expect 99.95% availability. Worth it? I would say yes.

Conclusion of this Part

I hope you've learned something here, or at least confirmed what you knew and gotten some links to bookmark! Next time, I'll talk about SLAs when you start combining services to create a single architecture.

要查看或添加评论，请登录

Chris S.的更多文章

A Question for Data People

2023年4月4日

A Question for Data People

A little background: I'm an old math geek. I took darn near every undergraduate math class offered at both Morehead…
PowerShell Modules Rule!

2023年3月24日

PowerShell Modules Rule!

Say you have CI/CD pipelines. You have Azure DevOps (ADO) and are finally using YML pipelines.
Low Code "Revolution"

2023年3月14日

Low Code "Revolution"

I saw an advertisement for Brainboard (Brainboard | Design, Deploy and Manage Multi-Cloud) this morning. I looked into…

1 条评论
A Terrible Terraform Pattern

2023年3月3日

A Terrible Terraform Pattern

Here's a scenario I've seen in multiple enterprises using Azure. Company decides to go with Terraform for all their…
A Fundamental Mistake in "DevOps"

2023年2月23日

A Fundamental Mistake in "DevOps"

I've been working as a "DevOps Engineer" for about 8 years, having been an infrastructure guy for about 15 years before…

1 条评论
Right Level of Automation

2023年1月25日

Right Level of Automation

I believe in automation and CI/CD..
Skepticism of Competence

2021年8月4日

Skepticism of Competence

My wife said something to me yesterday that I've been really thinking about now for the last 24 hours. She's worked in…

5 条评论
Service Level Agreement Part 3

2021年3月12日

Service Level Agreement Part 3

Part 1 and Part 2 of this series covered the basics of probability and service level agreements. Now it is time to get…
Probability and SLAs, Part 1

2021年3月8日

Probability and SLAs, Part 1

I recorded this quickly today as a refresher on probability. There are some links in the slides that I go through that…
A Series on How to Calculate Service Level Agreements

2021年3月4日

A Series on How to Calculate Service Level Agreements

When you sign up for a specific service, you are promised a percentage of time that the service will be available; this…

See all articles

Chris S.的更多文章

A Question for Data People

PowerShell Modules Rule!

Low Code "Revolution"

A Terrible Terraform Pattern

A Fundamental Mistake in "DevOps"

Right Level of Automation

Skepticism of Competence

Service Level Agreement Part 3

Probability and SLAs, Part 1

A Series on How to Calculate Service Level Agreements

社区洞察