Service Level Agreements Part 2
Hopefully, folks are feeling "refreshed" after viewing Part 1 of this series. So now let's talk about Service Level Agreements (SLA).
What is an SLA?
The SLA is essentially a promise from a service provider about how well their service is going to work. Let's talk about some examples.
How long before you will get a call back from a support ticket? You've probably run into this one when you've called the cable company or your doctor's billing department and had to leave a message. While not all companies publish their SLAs, in most they are at least tracking how long these return calls take, so they can hit some target. Like "The SLA for returned calls is within 30 minutes".
How long will it take for you to get permissions added to a file that you've requested? This is like the support ticket, but maybe your permission request must flow through 2 or 3 approvers. Where I work, most permissions requests seem to have an SLA of 5 days, though I believe I've seen a few as low as 3 days.
SLAs for Cloud Services
Back in the "old days", folks kept full control of their datacenters. They would have to make sure the server racks were appropriately wired for power, check on the backup power and possibly a generator, wire everything up, and watch all the servers/networking gear hum along. By having this control, infrastructure teams could flex some muscle and do things like reboot servers in a careful manner, thus keeping their applications online in the meantime. They could patch servers in the middle of the night or during the least busy times, thus working often late in the night or on the weekends.
Then folks started outsourcing their datacenter work and moving to offsite datacenters. The work still existed, but their people didn't have to do it all. This didn't cut some of the hand holding that teams could do with their servers, like rebooting clustered servers carefully, but a lot of the late nights were stopped. Sometimes you could even send your own folks to the datacenters to do things like replace servers. Pretty handy!
The cloud is not so different! In the cloud, there are still datacenters, power systems, server hardware, and networking gear. The difference is that you don't get to go there at all. Microsoft, Amazon, and Google certainly will not allow a server engineer from Mom-and-Pop Bubble Gum Shop to walk in and run something on a server.
So, what do you use to make sure you're running? You use the Service Level Agreements for the cloud services you consume, and design everything KNOWING THAT YOU WILL NOT BE UP 100% OF THE TIME. No more "cheating" and hand-holding your clusters through reboots. No more building a single big database server to run everything and just babysitting the server to make sure it runs, thus saving your core licensing costs. You MUST design for failure in the cloud.
A Change in Thinking
You must change your thoughts when it comes to uptime. While your provider may beat your SLA, they absolutely do not have to, and for many your SLA is their only target to meet. Microsoft documents this in the first paragraph of their Principles of the Reliability Pillar document:
"Building a reliable application in the cloud is different from traditional application development. While historically you may have purchased levels of redundant higher-end hardware to minimize the chance of an entire application platform failing, in the cloud, we acknowledge up front that failures will happen."
Similarly, Google's Site Reliability Engineering book (https://sre.google/sre-book/table-of-contents/) makes it noticeably clear that SRE do not have a goal of keeping things running 100% of the time. They only have a goal to meet the SLA, and if they do not have as much downtime as the SLA allows, then they are not making enough changes. (Chapter 3 - Managing Risk)
A Cool Site
One of my favorite sites is https://uptime.is/. On this site, you input the SLA and it will calculate how much downtime you can expect on a daily, weekly, monthly, and yearly basis. This is a brilliant place to play with the numbers to see for yourself just how bad something like "99% uptime" really is ( > 3.5 days of downtime a year!)
Azure
I'm going to focus on Azure only, as that's what I really know. Google (GCP) and AWS both have similar setups, so just take this info and use their own SLAs.
You can find the SLAs for all the cloud services offered by Azure at the following site: https://azure.microsoft.com/en-us/support/legal/sla/. Each service offered has a different SLA associated with it, and there are some caveats.
For instance, with Virtual Machines (VMs), you will see that Microsoft provides an incredibly low SLA on the least-expensive single instance VMs (95%). You can expect > 1 hour of downtime a day from such an SLA. So if you're going to run a simple webpage, like maybe your personal blog, you can expect the VM it's running on to only be accessible 95% of the time. By moving to an Availability Set, though, with at least two VMs, you can expect 99.95% availability. Worth it? I would say yes.
Conclusion of this Part
I hope you've learned something here, or at least confirmed what you knew and gotten some links to bookmark! Next time, I'll talk about SLAs when you start combining services to create a single architecture.