The Real Cause of the Crowdstrike Outage
TLDR: The Crowdstrike outage was seemingly caused by a faulty config update which led to a corner case not treated gracefully (or not at all) in a system driver code. This in turn caused a null pointer exception inside a system driver - and as the result - the whole operating system crashed in a proverbial BSOD.
It is a kind of an accident that's always waiting to happen. Bugs are inevitable, faulty configs will get applied.
While it's important to pinpoint the problem - the cause is systemic. And the only way to treat it - is by continuous application of systems thinking to the delivery process.
Finding The Root Cause
When a failure happens, we will always want to point a finger.
- It was this hole that we now need to close!
- It was this person we now need to punish/replace!
It's so tempting, almost irresistible to leverage someone else's failure in order to prove your point. Especially when your ego hurts.
But the truth is - if it were possible to prevent the Crowdstrike outage - it would've been prevented. The fault is not with the engineer who wrote the code. And not that "in Rust this couldn't have happened!" And not even in the update process that should've "staggered" (I wonder who thought of using this word?!)
The failure, like all other failures - is systemic. And it can only be solved on the systemic level.
Let me explain what I mean.
Speed and Stability
Software delivery is a process of moving information. Like every process it has two main interdependent characteristics - speed and stability. There's a certain speed at which the process is at its most stable. Going any slower or any faster than that speed brings risk. That's when we - the engineers - need to add or remove feedback loops in order to either accelerate or balance the process.
In modern business reality engineers are much more focused on acceleration, while the balancing activities are usually added as an afterthought.
领英推荐
Just look at the Platform Engineering discipline - most of its focus today seems to be on developer productivity, not on developer safety. When the glorified startup mentality of moving fast and breaking stuff seeps into enterprise software - bad things are waiting to happen.
But the desire to go faster is not the actual problem! It's just a business requirement. And a justified one. Security breaches must be prevented in time!
The real reason for this and other major outages (that will inevitably occur) is this endless hunt for the root cause. For that component or that person we can fix. Because the discovery of a local problem calls for a local solution. Which means that on the systemic level the problem prevails.
Thinking in Systems Yet Again
And the correct way of dealing with it is by applying systems thinking to our delivery process.
By realizing that each time we remove a bottleneck - we also create risk we now need to mitigate. That somewhat paradoxically - each layer of security or policy enforcement or automation that we add - contributes to the amount of risk.
So yes - by all means - rewrite your code in memory safe languages, ensure 100% of test coverage, hire the best engineers and cater to their psychological safety, introduce guardrails into your update processes, limit access to the kernel... These are all valid steps towards system reliability. But let's not forget they all add stress, require investment and increase the overall complexity. Or in other words - the smarter is your platform - the more of a potential bottleneck it becomes.
Systems thinking requires a continuous evaluation of risk, opportunities and constraints. It also calls for what Dr. Deming called Profound Knowledge - i.e understanding the mechanics, the semantics, the statistics and finally - the psychology of the humans doing the work. After all - humans are the ones who create and consume software, humans are the ones who decide if it's urgent to push that update and how to roll it out and humans are the ones impacted by the outage.
So let me finish this post by expressing my deepest empathy for everyone who missed a flight or a medical procedure, for all the engineers who were involved in rolling out this update and are now under the stress of mitigating its unpleasant outcomes. When SRE fails, #hugops is the way to get back on track.
And then we can go back to reevaluating our platform strategy - find the bottleneck, elevate the constraint, mitigate the risk, measure, optimize, rinse and repeat.
Smooth delivery to you all!
#platformengineering #devops #security
CEO and security engineer
2 个月???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ
???? ??? ?? ??????! ??? ????? ???? ?????? ??? ?????? ??? ??????? ???? ????? ?????? ?????? ???? ?????? ???? ????, ????? ????? ?????? ?????? ?????: https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU
?????? Co-founder + CTO @ Wired Relations - The GRC solution ??
4 个月I would have loved to spectate the "blameless postmortem". ??
CEO @ Daynix Computing LTD | Technological Leadership
4 个月Time to use CI for kernel drivers. Ask me how :)
This is just the top of the iceberg. Here the root of the problem: https://www.dhirubhai.net/posts/hugovazquez_engineering-bypass-quality-activity-7220314060376969216-Vikg?utm_source=share&utm_medium=member_desktop