EP17 - How a Tiny Bug Took Down the World
Michele Brissoni
?? I help PE/VC investors & CTOs fix underperforming IT assets to unlock exponential ROI ?? | Behavioral Engineering Expert | Inventor of the SW Craftsmanship Dojo? | Trusted by IBM, ZF, NS & top investors
??? Friday 19th of July 2024 - The Wake-Up Call day!
Imagine waking up to a world where a single software glitch causes economic losses of $ 10 billion, with $ 5.4 billion blown up by Fortune 500 companies alone. Unfortunately, this is reality, not a nightmare. This exact scenario unfolded with the CrowdStrike outage, impacting businesses globally. This disaster isn’t just a wake-up call for IT departments but a siren for all industrial leaders. The damage is likely underestimated. IT’s influence on our global economy, political stability, and safety is colossal. We’re in the era of Industry 5.0, with many still catching up to Industry 4.0 or even 3.0 paradigms. IT forms the backbone of every industrial and civil sector.
Just consider the catastrophic potential: a bug like this could lead to planes going off course, nuclear reactors malfunctioning, or critical infrastructure collapsing. The way we create software, middleware, and hardware has never been more critical.
The CrowdStrike incident is a stark reminder that robust engineering practices especially in SW Development are non-negotiable anymore!
The Tech Meltdown Nobody Saw Coming
Picture this: it’s a regular Friday, and suddenly, Windows machines globally start dropping like flies. ?? Cue the dreaded Blue Screen of Death ?? (BSOD). The culprit? A faulty update in CrowdStrike’s Falcon Sensor that caused an out-of-memory (OOM) error. The system crashes were due to an undetected error in the InterProcessCommunication (IPC) Template Instance, leading to out-of-bounds memory read. This wasn’t just a minor oops—it’s a global meltdown that required manual intervention! Millions of machines needed restarting and fixing manually, like in the old days. ??? CrowdStrike updated us in real-time, but as an engineer, I was disappointed with the lack of technical details. I get they need to protect their IP, but come on, we’re talking about a global disaster here! How did it get to this point? Why wasn't it noticed? Where's the guilty code and the automated tests? ??
The story and tech details we can gather seem designed to keep people less alarmed than they should be. One of the principles of science and engineering is the replicability of an experiment. Without all the technical details of such a disaster, how can we be sure the remedy was found and that nothing worse will happen in the future? We simply can't. So, we either choose to sleep between two pillows until the next global disaster, or we start rolling up our sleeves and changing the industry to a more engineered sector.
We need fewer "software developers" and more "software engineers" who know both principles of engineering and craftsmanship. ??
The Plot Thickens - My Personal Analysis
Lately, everyone has turned into a CSI detective, pointing fingers at CrowdStrike. I won’t regurgitate the wild theories floating around but offer my perspective with over three decades in software engineering, including high-performance environments like Formula 1 and MotoGP.
From their public GitHub repositories and my Key Behavioral Indicators analysis, it seems that CrowdStrike's dev team skips crucial engineering steps. Proper Test-Driven Development (TDD) is non-existent. They seem to patch code rather than writing tests first, creating a house of cards. Their tests barely pass coverage checks catching only "trivial" issues. Major engineering practices like conventional commits, semantic releases, mutation testing, and fundamental like a proper testing pyramid are missing.
Acceptance Test-Driven Development (ATDD) and Behavior-Driven Development (BDD)? Nowhere to be found. These practices ensure software meets real-world needs and behaves correctly. Without them, updates are ticking time bombs.
In high-performance environments, these practices are fundamental for reliable and high-performing products. Blaming developers for such disasters is shortsighted. The real issue is systemic—a sick software industry treating development as a cost rather than an essential investment. If this status quo persists, we risk facing catastrophes far worse than rebooting Windows machines or flight cancellations.
CrowdStrike's code reflects a broader industry problem that needs addressing before we face even more severe consequences.
The Real Villain: Global Lack of SW Engineering Discipline
No Test-Driven development, lipstick DevOps, no refactoring, no clean code principles. Just quick fixes piled on, leading to fragile systems ready to collaps at the first hard hit. This sloppy approach is the real problem out there in the majority of companies—it’s an industry-wide issue. The rush to deliver sacrificing quality, leads to disasters like this one we just faced. This wasn’t just a bad day for CrowdStrike. Airlines, banks, retailers, and even law enforcement were hit hard. The bug fix, meant to be a quick patch, turned into a chain reaction of new bugs and issues. In the interconnected world of Industry 5.0, where IT integrates with human-centric approaches, this poor quality structural issue can lead to economical disasters worse than COVID!
领英推荐
Technology is evolving at hyperbolic speed. Companies, trying to keep up, take shortcuts, blind to the risks they’re inviting. My perspective aligns with many industry leaders, feeling particularly close to me these:
This fiasco is a wake-up call. In a hyper connected world reliant on robust IT, we can’t afford shortcuts. Proper engineering practices aren’t optional. They’re essential.
Leaders, We've Got a First-Aid Kit for You
Next time you hear about a tech meltdown, remember: it's not just bad luck. It’s a sign of deeper issues in how we build and maintain our software. Learn from CrowdStrike’s crash and commit to better practices. Their stock fell 30% in the blink of an eye, even before legal trials began. Can your company survive that? Probably not. Is your risk management policy sustainable or just giving you headaches?
Stay sharp, and empower your IT departments to code smart with our BriX Consulting Unicorns Ecosystem. Knowledge is power, and now you know. Your IT departments are unequipped to handle this hyper-connected IT world. Your governance and risk management aren’t ready to navigate such crises. Your products can suddenly become faulty, leading you to shut down.
Don’t be foolish to continue with sloppy software engineering in your company!
Jump onboard the BriX Consulting Unicorns Ecosystem, and together, we’ll evolve your organization toward digital excellence! ??
--
?? Due to this week's exceptional events, we've temporarily shifted focus to address the urgent wake-up call from the CrowdStrike incident. We'll return to our deep dive into the Unicorns' Ecosystem ASAP, once the IT leadership community has fully absorbed and understood the implications of this critical situation. Stay tuned for more insights!
?? Subscribe to "The Forge of Unicorns" newsletter and stay ahead of the digital game.
?? YouTube
?? Spotify
Very informative