登录查看更多内容

The Real Cause of the Crowdstrike Outage

Anton Weiss

Platform Evangelist & Enabler, Software Delivery Optimization Expert

发布日期: 2024年7月21日

TLDR: The Crowdstrike outage was seemingly caused by a faulty config update which led to a corner case not treated gracefully (or not at all) in a system driver code. This in turn caused a null pointer exception inside a system driver - and as the result - the whole operating system crashed in a proverbial BSOD.

It is a kind of an accident that's always waiting to happen. Bugs are inevitable, faulty configs will get applied.

While it's important to pinpoint the problem - the cause is systemic. And the only way to treat it - is by continuous application of systems thinking to the delivery process.

Finding The Root Cause

When a failure happens, we will always want to point a finger.

- It was this hole that we now need to close!

- It was this person we now need to punish/replace!

It's so tempting, almost irresistible to leverage someone else's failure in order to prove your point. Especially when your ego hurts.

But the truth is - if it were possible to prevent the Crowdstrike outage - it would've been prevented. The fault is not with the engineer who wrote the code. And not that "in Rust this couldn't have happened!" And not even in the update process that should've "staggered" (I wonder who thought of using this word?!)

The failure, like all other failures - is systemic. And it can only be solved on the systemic level.

Let me explain what I mean.

Speed and Stability

Software delivery is a process of moving information. Like every process it has two main interdependent characteristics - speed and stability. There's a certain speed at which the process is at its most stable. Going any slower or any faster than that speed brings risk. That's when we - the engineers - need to add or remove feedback loops in order to either accelerate or balance the process.

In modern business reality engineers are much more focused on acceleration, while the balancing activities are usually added as an afterthought.

Firefly 3 个月前

The Hidden Costs of Legacy IT Systems

Agron Fazliu 5 个月前

ImagineX Q1 Newsletter

ImagineX 7 个月前

Just look at the Platform Engineering discipline - most of its focus today seems to be on developer productivity, not on developer safety. When the glorified startup mentality of moving fast and breaking stuff seeps into enterprise software - bad things are waiting to happen.

But the desire to go faster is not the actual problem! It's just a business requirement. And a justified one. Security breaches must be prevented in time!

The real reason for this and other major outages (that will inevitably occur) is this endless hunt for the root cause. For that component or that person we can fix. Because the discovery of a local problem calls for a local solution. Which means that on the systemic level the problem prevails.

Thinking in Systems Yet Again

And the correct way of dealing with it is by applying systems thinking to our delivery process.

By realizing that each time we remove a bottleneck - we also create risk we now need to mitigate. That somewhat paradoxically - each layer of security or policy enforcement or automation that we add - contributes to the amount of risk.

So yes - by all means - rewrite your code in memory safe languages, ensure 100% of test coverage, hire the best engineers and cater to their psychological safety, introduce guardrails into your update processes, limit access to the kernel... These are all valid steps towards system reliability. But let's not forget they all add stress, require investment and increase the overall complexity. Or in other words - the smarter is your platform - the more of a potential bottleneck it becomes.

Systems thinking requires a continuous evaluation of risk, opportunities and constraints. It also calls for what Dr. Deming called Profound Knowledge - i.e understanding the mechanics, the semantics, the statistics and finally - the psychology of the humans doing the work. After all - humans are the ones who create and consume software, humans are the ones who decide if it's urgent to push that update and how to roll it out and humans are the ones impacted by the outage.

So let me finish this post by expressing my deepest empathy for everyone who missed a flight or a medical procedure, for all the engineers who were involved in rolling out this update and are now under the stress of mitigating its unpleasant outcomes. When SRE fails, #hugops is the way to get back on track.

And then we can go back to reevaluating our platform strategy - find the bottleneck, elevate the constraint, mitigate the risk, measure, optimize, rinse and repeat.

Smooth delivery to you all!

#platformengineering #devops #security

Netanel Stern

CEO and security engineer

2 个月

???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ

Yaakov Levin

2 个月

???? ??? ?? ??????! ??? ????? ???? ?????? ??? ?????? ??? ??????? ???? ????? ?????? ?????? ???? ?????? ???? ????, ????? ????? ?????? ?????? ?????: https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

Bobby Nielsen

?????? Co-founder + CTO @ Wired Relations - The GRC solution ??

4 个月

I would have loved to spectate the "blameless postmortem". ??

2 次回应

Yan Vugenfirer

CEO @ Daynix Computing LTD | Technological Leadership

4 个月

Time to use CI for kernel drivers. Ask me how :)

5 次回应

Hugo Vázquez Caramés

4 个月

This is just the top of the iceberg. Here the root of the problem: https://www.dhirubhai.net/posts/hugovazquez_engineering-bypass-quality-activity-7220314060376969216-Vikg?utm_source=share&utm_medium=member_desktop

查看更多评论

要查看或添加评论，请登录

查看全部

The Real Cause of the Crowdstrike Outage

Anton Weiss

Platform Evangelist & Enabler, Software Delivery Optimization Expert

Finding The Root Cause

Speed and Stability

领英推荐

Thinking in Systems Yet Again

更多精彩文章

社区洞察

其他会员也浏览了

Crowdstrike / Microsoft Outage Analysis

Architect Insights: June 2024 Newsletter

Mend.io Collaborates with Kondukto’s New Demo Hub

Learning Forward without Technology

Global IT Outage: Lessons from the CrowdStrike Update Incident

Ignoring Red Flags: CrowdStrike’s Fail and the Human Cost

The CrowdStrike Outage of July 19th, 2024

Ensuring System Reliability through Traditional Testing & Quality Engineering: Lessons from the CrowdStrike Outage

A Comprehensive Root Cause Analysis (RCA): The Technical Mishap Behind a Software Update Crash that Crippled Windows Systems Globally

Announcing Release 3.5

Finding The Root Cause

Speed and Stability

领英推荐

Thinking in Systems Yet Again

Stop Saying "You Don't Need Kubernetes!"

2024年6月26日

Platform Engineering - Passing the Frustration Token Back to the Ops.

2024年6月15日

Developer Experience is just the tip of the iceberg.

2024年4月25日

Platform Engineering Day at KubeCon 2024 - The 5 Talks I Definitely Plan To Attend.

2024年3月5日

Beyond the Platform Hype

2024年2月11日

Turja Chaudhuri: Without a Platform - Enterprise IT is a Mess

2024年1月3日

The Pains of K8S Auto-Scaling

2023年11月22日

Is Your Platform Team Worth It?

2023年11月21日

Autonomous Agents and the New Era of Developer Productivity

2023年11月15日

DevOps Shorts 020 - Shauli Rozen - the Next Level of #devsecops

2022年1月17日

社区洞察

其他会员也浏览了

Crowdstrike / Microsoft Outage Analysis

Architect Insights: June 2024 Newsletter

Mend.io Collaborates with Kondukto’s New Demo Hub

Learning Forward without Technology

Global IT Outage: Lessons from the CrowdStrike Update Incident

Ignoring Red Flags: CrowdStrike’s Fail and the Human Cost

The CrowdStrike Outage of July 19th, 2024

Ensuring System Reliability through Traditional Testing & Quality Engineering: Lessons from the CrowdStrike Outage

A Comprehensive Root Cause Analysis (RCA): The Technical Mishap Behind a Software Update Crash that Crippled Windows Systems Globally

Announcing Release 3.5