登录查看更多内容

Workload Availability on GCP: When 99.99% is less than customers expect

Klaus Haller

发布日期: 2024年2月27日

The cloud works round the clock with perfectly automated maintenance tasks and on hardware that is more dependable than ever. So, can application owners and CIOs trust that everything runs 100% perfectly, e.g., on the Google Cloud Platform? The quick answer is “no,” - and delving into GCP’s Service Level Objective (SLO) for sample core workload services - Compute Engine, Cloud Function, Cloud Run, and GCP App Engine – is an eye-opener for managers. GCP’s SLOs are brutally clear but do not match naive customer expectations.

GCP’s SLOs for Compute Engine

Compute Engine is Google Cloud Platform’s (GCP) name for Virtual Machines (VMs). GCP guarantees an impressive monthly availability for single VMs of 99.9%, with some VM configurations even achieving 99.95% (memory-optimized, premium tier). By deploying premium tier VMs in multiple zones, the availability can be up to 99.99%. These numbers translate to around 40 or 4 minutes of maximum monthly downtime.

Google’s SLO fine print, however, is not impressive but shocking. An issue must persist for more than one minute to be considered relevant for the SLO – but one-minute outages are equivalent to 0.002% downtime, which adds up quite quickly, making the value of a 99.99% SLO questionable. Then, downtime is defined as losing external connectivity or access to the persistent disk(s). What does it mean for crashed (?terminated?) VMs?

First, GCP does not charge for them (yippie!). Second, to my best understanding, it is irrelevant for SLOs (hmmm). Third, no VM means no application execution. So, crashed VMs might be an issue for IT managers who must commit uptime guarantees on an application level to their business stakeholders. Thus, when looking at GCP’s VM SLOs, IT departments must architect high availability on the application level to handle shorter outages, even if they run on the cloud.

GCP’s SLOs for Cloud Function, Cloud Run, and App Engine

Let’s move from IaaS to PaaS workloads. GCP Cloud Function eases the execution of short code snippets. It is extra-beneficial for ?gluing? services together, e.g., if adding a new object to a Cloud Storage bucket should trigger some processing such as image recognition or searching for hate words in a text document. Cloud Run enables engineers to create functions executing containerized application code. A (web) service invocation triggers then their execution.

Both GCP services, Cloud Function and Cloud Run, come with a 99.95% availability guarantee. But, again, this GCP SLO comes with surprising side constraints:

At least 10% of the requests must fail (only GCP-induced issues count)
This situation must last for at least one minute.
The measurement period must contain at least 100 service requests.

The definition for Cloud Run downtimes is the following:

领英推荐

AZURE Cloud Monthly Updates Newsletter – August 2024.

Santhosh (Santhoshkumar) Anandakrishnan 7 个月前

Cloud Technology News of the Month: December 2023

Cast AI 1 年前

Don't Bet Against the Cloud

Kendall Miller 1 年前

Error rate of 1% (HTTPS 5xx status answers)
This situation must last for at least one minute.
The measurement period must contain at least 100 service requests.

So, the GCP 99.95% SLO does not imply that Google guarantees you that 99.95% of the service requests succeed! If (close to) 1% of Cloud Run respectively 10% of Cloud Function invocations fail in a month, it might still be within the GCP 99.95% SLO, depending on the timely distribution of the failed requests.

How are SLOs in the Azure World?

After my shocking experience with the GCP SLOs, I quickly looked at the Azure SLOs. My findings? Azure’s availability guarantees, when stacked against GCP, fall short. For instance, a basic Azure VM has a mere 95% guarantee, quite a contrast to GCP’s 99.9% minimum. Azure customers can bring it up to 99.9% with more expensive disks; however, this availability guarantee is still below GCP’s ?best? single VM option (99.95%). However, the significance of these numbers is limited.

The first key distinction between GCP and Azure: Azure does not subscribe to the “an outage of less than 1 minute does not matter” doctrine. Such short outages might not be a big deal for batch workloads, but they are a significant concern for interactive workloads, where a 10-second outage can mean that the customer fails with a spontaneous purchase in an online shop.

The second fundamental difference is that deploying the same VM in additional locations can drive the SLO up to 12 ?9?s (again, no one-minute trick), much more than the GCP 99.99% availability for multi-zone deployments. The second big difference is that customers can boost SLOs to 12 “9” s (without any one-minute tricks) by deploying the same VM in multiple locations. For such cases, GCP promises only 99.99%.

When looking at a sample PaaS service such as Azure Function, Azure and GCP seem on par. Azure promises 99.95% for Azure Function, and GCP the same for their Cloud Function service. But again, Azure distinguishes itself from GCP by aligning its SLOs with what one might consider “naive customer expectations.” Azure does not ignore 1% or 10% error rates for their SLO, nor does Microsoft calculate with a “one-minute-outage-is-fine” trick. Azure meticulously tracks how many service innovations fail for Azure Function. We see the same pattern for other SLOs, e.g., Azure Function App. The SLO looks at the complete period and calculates the percentage of the service being available.

So, for me, it was really intriguing to observe the varying approaches cloud providers take to SLOs, even though most customers may not notice any differences in reality.

The Swiss Cloud Sec Architect

787 位关注者

要查看或添加评论，请登录

Klaus Haller的更多文章

Some thoughts about Data Classification and Labelling in the Cloud

2025年3月11日

Some thoughts about Data Classification and Labelling in the Cloud

Data and information classification and labelling must be important if the ISO standard has two dedicated controls for…
The most essentical cloud-native Security Services in AWS, Azure, and GCP

2025年3月10日

The most essentical cloud-native Security Services in AWS, Azure, and GCP

The pure number of cloud (security) services might overwhelm security specialists, in particular when they work in…
A Short Intro to Logging in the Cloud

2025年2月20日

A Short Intro to Logging in the Cloud

Logging is the systematic recording of events in an IT environment. It is the foundation for proactively identifying…
Security Architects & Cloud Backup Strategies

2025年2月17日

Security Architects & Cloud Backup Strategies

Cloud security architects should understand well-established backup concepts and patterns—such as RTO, RPO, and the…

2 条评论
Is Workload Security Overrated? ??

2025年2月13日

Is Workload Security Overrated? ??

Lately, I've been rethinking our priorities in security architecture. Are we putting too much emphasis on workload…

2 条评论
DeepSeek - Shaking Up the AI Marketplace Without Redefining AI

2025年1月28日

DeepSeek - Shaking Up the AI Marketplace Without Redefining AI

All eyes are on DeepSeek, the emerging AI star from China. But how does DeepSeek revolutionize the world of artificial…
RedHat Connect 2025 Dübendorf: Containers, Automation, and AI

2025年1月15日

RedHat Connect 2025 Dübendorf: Containers, Automation, and AI

Today, I had the pleasure of attending the RedHat Connect 2025 event in Dübendorf, a stone's throw away from Zurich…

1 条评论
My Top-3 2024 Security Articles

2024年12月30日

My Top-3 2024 Security Articles

As we look back on 2024, I want to highlight my most impactful posts that really connected with my audience. If you…
Securing AI: What the OWASP LLM Top 10 Gets Right – and What It Misses

2024年12月24日

Securing AI: What the OWASP LLM Top 10 Gets Right – and What It Misses

As the year winds down and we reflect on how much technology has shaped 2024, it’s hard not to notice how AI –…
Certificate Management in Azure and GCP: A Brief Look

2024年12月22日

Certificate Management in Azure and GCP: A Brief Look

Certificates play a crucial role in securing communication and controlling access to (web) services. All leading clouds…

See all articles

Workload Availability on GCP: When 99.99% is less than customers expect

Klaus Haller

GCP’s SLOs for Compute Engine

GCP’s SLOs for Cloud Function, Cloud Run, and App Engine

领英推荐

How are SLOs in the Azure World?

The Swiss Cloud Sec Architect

787 位关注者

Klaus Haller的更多文章

社区洞察

其他会员也浏览了

Don't Bet Against the Cloud

SMB Playbook for IT spend reduction on Azure - Part 2

Considerations for Successful Rightsizing in Cloud

AWS vs. Azure vs. Google

Everyone knows what "The Cloud" is, right?

“Multi-cloud is the worst practice” – is it, really? Cloud me a river....

Managing Network Transit Costs in the Cloud

“OCI vs. AWS vs. Azure: Which Cloud Computing Platform is Right for You?”

Cloud Pricing: Why It’s More Complex Than You Think

Overspending on Cloud? Rein in the largest utility

GCP’s SLOs for Compute Engine

GCP’s SLOs for Cloud Function, Cloud Run, and App Engine

领英推荐

How are SLOs in the Azure World?

The Swiss Cloud Sec Architect

787 位关注者

Klaus Haller的更多文章

Some thoughts about Data Classification and Labelling in the Cloud

The most essentical cloud-native Security Services in AWS, Azure, and GCP

A Short Intro to Logging in the Cloud

Security Architects & Cloud Backup Strategies

Is Workload Security Overrated? ??

DeepSeek - Shaking Up the AI Marketplace Without Redefining AI

RedHat Connect 2025 Dübendorf: Containers, Automation, and AI

My Top-3 2024 Security Articles

Securing AI: What the OWASP LLM Top 10 Gets Right – and What It Misses

Certificate Management in Azure and GCP: A Brief Look

社区洞察

其他会员也浏览了

Don't Bet Against the Cloud

SMB Playbook for IT spend reduction on Azure - Part 2

Considerations for Successful Rightsizing in Cloud

AWS vs. Azure vs. Google

Everyone knows what "The Cloud" is, right?

“Multi-cloud is the worst practice” – is it, really? Cloud me a river....

Managing Network Transit Costs in the Cloud

“OCI vs. AWS vs. Azure: Which Cloud Computing Platform is Right for You?”

Cloud Pricing: Why It’s More Complex Than You Think

Overspending on Cloud? Rein in the largest utility