The Real Cause of the Crowdstrike Outage
Image generated by MCP (math chaos patterns) software: https://www.mathchaospatterns.com/mcp

The Real Cause of the Crowdstrike Outage

TLDR: The Crowdstrike outage was seemingly caused by a faulty config update which led to a corner case not treated gracefully (or not at all) in a system driver code. This in turn caused a null pointer exception inside a system driver - and as the result - the whole operating system crashed in a proverbial BSOD.

It is a kind of an accident that's always waiting to happen. Bugs are inevitable, faulty configs will get applied.

While it's important to pinpoint the problem - the cause is systemic. And the only way to treat it - is by continuous application of systems thinking to the delivery process.

Finding The Root Cause

When a failure happens, we will always want to point a finger.

- It was this hole that we now need to close!

- It was this person we now need to punish/replace!

It's so tempting, almost irresistible to leverage someone else's failure in order to prove your point. Especially when your ego hurts.

But the truth is - if it were possible to prevent the Crowdstrike outage - it would've been prevented. The fault is not with the engineer who wrote the code. And not that "in Rust this couldn't have happened!" And not even in the update process that should've "staggered" (I wonder who thought of using this word?!)

The failure, like all other failures - is systemic. And it can only be solved on the systemic level.

Let me explain what I mean.

Speed and Stability

Software delivery is a process of moving information. Like every process it has two main interdependent characteristics - speed and stability. There's a certain speed at which the process is at its most stable. Going any slower or any faster than that speed brings risk. That's when we - the engineers - need to add or remove feedback loops in order to either accelerate or balance the process.

In modern business reality engineers are much more focused on acceleration, while the balancing activities are usually added as an afterthought.

Just look at the Platform Engineering discipline - most of its focus today seems to be on developer productivity, not on developer safety. When the glorified startup mentality of moving fast and breaking stuff seeps into enterprise software - bad things are waiting to happen.

But the desire to go faster is not the actual problem! It's just a business requirement. And a justified one. Security breaches must be prevented in time!

The real reason for this and other major outages (that will inevitably occur) is this endless hunt for the root cause. For that component or that person we can fix. Because the discovery of a local problem calls for a local solution. Which means that on the systemic level the problem prevails.

Thinking in Systems Yet Again

And the correct way of dealing with it is by applying systems thinking to our delivery process.

By realizing that each time we remove a bottleneck - we also create risk we now need to mitigate. That somewhat paradoxically - each layer of security or policy enforcement or automation that we add - contributes to the amount of risk.

So yes - by all means - rewrite your code in memory safe languages, ensure 100% of test coverage, hire the best engineers and cater to their psychological safety, introduce guardrails into your update processes, limit access to the kernel... These are all valid steps towards system reliability. But let's not forget they all add stress, require investment and increase the overall complexity. Or in other words - the smarter is your platform - the more of a potential bottleneck it becomes.

Systems thinking requires a continuous evaluation of risk, opportunities and constraints. It also calls for what Dr. Deming called Profound Knowledge - i.e understanding the mechanics, the semantics, the statistics and finally - the psychology of the humans doing the work. After all - humans are the ones who create and consume software, humans are the ones who decide if it's urgent to push that update and how to roll it out and humans are the ones impacted by the outage.

So let me finish this post by expressing my deepest empathy for everyone who missed a flight or a medical procedure, for all the engineers who were involved in rolling out this update and are now under the stress of mitigating its unpleasant outcomes. When SRE fails, #hugops is the way to get back on track.

And then we can go back to reevaluating our platform strategy - find the bottleneck, elevate the constraint, mitigate the risk, measure, optimize, rinse and repeat.

Smooth delivery to you all!

#platformengineering #devops #security


Netanel Stern

CEO and security engineer

2 个月

???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ

回复

???? ??? ?? ??????! ??? ????? ???? ?????? ??? ?????? ??? ??????? ???? ????? ?????? ?????? ???? ?????? ???? ????, ????? ????? ?????? ?????? ?????: https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

回复
Bobby Nielsen

?????? Co-founder + CTO @ Wired Relations - The GRC solution ??

4 个月

I would have loved to spectate the "blameless postmortem". ??

Yan Vugenfirer

CEO @ Daynix Computing LTD | Technological Leadership

4 个月

Time to use CI for kernel drivers. Ask me how :)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了