Workload Availability on GCP: When 99.99% is less than customers expect

Workload Availability on GCP: When 99.99% is less than customers expect

The cloud works round the clock with perfectly automated maintenance tasks and on hardware that is more dependable than ever. So, can application owners and CIOs trust that everything runs 100% perfectly, e.g., on the Google Cloud Platform? The quick answer is “no,” - and delving into GCP’s Service Level Objective (SLO) for sample core workload services - Compute Engine, Cloud Function, Cloud Run, and GCP App Engine – is an eye-opener for managers. GCP’s SLOs are brutally clear but do not match naive customer expectations.

GCP’s SLOs for Compute Engine

Compute Engine is Google Cloud Platform’s (GCP) name for Virtual Machines (VMs). GCP guarantees an impressive monthly availability for single VMs of 99.9%, with some VM configurations even achieving 99.95% (memory-optimized, premium tier). By deploying premium tier VMs in multiple zones, the availability can be up to 99.99%. These numbers translate to around 40 or 4 minutes of maximum monthly downtime.

Google’s SLO fine print, however, is not impressive but shocking. An issue must persist for more than one minute to be considered relevant for the SLO – but one-minute outages are equivalent to 0.002% downtime, which adds up quite quickly, making the value of a 99.99% SLO questionable. Then, downtime is defined as losing external connectivity or access to the persistent disk(s). What does it mean for crashed (?terminated?) VMs?

First, GCP does not charge for them (yippie!). Second, to my best understanding, it is irrelevant for SLOs (hmmm). Third, no VM means no application execution. So, crashed VMs might be an issue for IT managers who must commit uptime guarantees on an application level to their business stakeholders. Thus, when looking at GCP’s VM SLOs, IT departments must architect high availability on the application level to handle shorter outages, even if they run on the cloud.

GCP’s SLOs for Cloud Function, Cloud Run, and App Engine

Let’s move from IaaS to PaaS workloads. GCP Cloud Function eases the execution of short code snippets. It is extra-beneficial for ?gluing? services together, e.g., if adding a new object to a Cloud Storage bucket should trigger some processing such as image recognition or searching for hate words in a text document. Cloud Run enables engineers to create functions executing containerized application code. A (web) service invocation triggers then their execution.

Both GCP services, Cloud Function and Cloud Run, come with a 99.95% availability guarantee. But, again, this GCP SLO comes with surprising side constraints:

  • At least 10% of the requests must fail (only GCP-induced issues count)
  • This situation must last for at least one minute.
  • The measurement period must contain at least 100 service requests.

The definition for Cloud Run downtimes is the following:

  • Error rate of 1% (HTTPS 5xx status answers)
  • This situation must last for at least one minute.
  • The measurement period must contain at least 100 service requests.

So, the GCP 99.95% SLO does not imply that Google guarantees you that 99.95% of the service requests succeed! If (close to) 1% of Cloud Run respectively 10% of Cloud Function invocations fail in a month, it might still be within the GCP 99.95% SLO, depending on the timely distribution of the failed requests.

How are SLOs in the Azure World?

After my shocking experience with the GCP SLOs, I quickly looked at the Azure SLOs. My findings? Azure’s availability guarantees, when stacked against GCP, fall short. For instance, a basic Azure VM has a mere 95% guarantee, quite a contrast to GCP’s 99.9% minimum. Azure customers can bring it up to 99.9% with more expensive disks; however, this availability guarantee is still below GCP’s ?best? single VM option (99.95%). However, the significance of these numbers is limited.

The first key distinction between GCP and Azure: Azure does not subscribe to the “an outage of less than 1 minute does not matter” doctrine. Such short outages might not be a big deal for batch workloads, but they are a significant concern for interactive workloads, where a 10-second outage can mean that the customer fails with a spontaneous purchase in an online shop.

The second fundamental difference is that deploying the same VM in additional locations can drive the SLO up to 12 ?9?s (again, no one-minute trick), much more than the GCP 99.99% availability for multi-zone deployments. The second big difference is that customers can boost SLOs to 12 “9” s (without any one-minute tricks) by deploying the same VM in multiple locations. For such cases, GCP promises only 99.99%.

When looking at a sample PaaS service such as Azure Function, Azure and GCP seem on par. Azure promises 99.95% for Azure Function, and GCP the same for their Cloud Function service. But again, Azure distinguishes itself from GCP by aligning its SLOs with what one might consider “naive customer expectations.” Azure does not ignore 1% or 10% error rates for their SLO, nor does Microsoft calculate with a “one-minute-outage-is-fine” trick. Azure meticulously tracks how many service innovations fail for Azure Function. We see the same pattern for other SLOs, e.g., Azure Function App. The SLO looks at the complete period and calculates the percentage of the service being available.

So, for me, it was really intriguing to observe the varying approaches cloud providers take to SLOs, even though most customers may not notice any differences in reality.

?

?

要查看或添加评论,请登录

Klaus Haller的更多文章

社区洞察

其他会员也浏览了