登录查看更多内容

EP17 - How a Tiny Bug Took Down the World

Michele Brissoni

?? I help PE/VC investors & CTOs fix underperforming IT assets to unlock exponential ROI ?? | Behavioral Engineering Expert | Inventor of the SW Craftsmanship Dojo? | Trusted by IBM, ZF, NS & top investors

发布日期: 2024年7月25日

??? Friday 19th of July 2024 - The Wake-Up Call day!

Imagine waking up to a world where a single software glitch causes economic losses of $ 10 billion, with $ 5.4 billion blown up by Fortune 500 companies alone. Unfortunately, this is reality, not a nightmare. This exact scenario unfolded with the CrowdStrike outage, impacting businesses globally. This disaster isn’t just a wake-up call for IT departments but a siren for all industrial leaders. The damage is likely underestimated. IT’s influence on our global economy, political stability, and safety is colossal. We’re in the era of Industry 5.0, with many still catching up to Industry 4.0 or even 3.0 paradigms. IT forms the backbone of every industrial and civil sector.

https://knowhow.distrelec.com/manufacturing/is-your-business-ready-for-industry-5-0/

Just consider the catastrophic potential: a bug like this could lead to planes going off course, nuclear reactors malfunctioning, or critical infrastructure collapsing. The way we create software, middleware, and hardware has never been more critical.

The CrowdStrike incident is a stark reminder that robust engineering practices especially in SW Development are non-negotiable anymore!

The Tech Meltdown Nobody Saw Coming

Picture this: it’s a regular Friday, and suddenly, Windows machines globally start dropping like flies. ?? Cue the dreaded Blue Screen of Death ?? (BSOD). The culprit? A faulty update in CrowdStrike’s Falcon Sensor that caused an out-of-memory (OOM) error. The system crashes were due to an undetected error in the InterProcessCommunication (IPC) Template Instance, leading to out-of-bounds memory read. This wasn’t just a minor oops—it’s a global meltdown that required manual intervention! Millions of machines needed restarting and fixing manually, like in the old days. ??? CrowdStrike updated us in real-time, but as an engineer, I was disappointed with the lack of technical details. I get they need to protect their IP, but come on, we’re talking about a global disaster here! How did it get to this point? Why wasn't it noticed? Where's the guilty code and the automated tests? ??

The story and tech details we can gather seem designed to keep people less alarmed than they should be. One of the principles of science and engineering is the replicability of an experiment. Without all the technical details of such a disaster, how can we be sure the remedy was found and that nothing worse will happen in the future? We simply can't. So, we either choose to sleep between two pillows until the next global disaster, or we start rolling up our sleeves and changing the industry to a more engineered sector.

We need fewer "software developers" and more "software engineers" who know both principles of engineering and craftsmanship. ??

The Plot Thickens - My Personal Analysis

Lately, everyone has turned into a CSI detective, pointing fingers at CrowdStrike. I won’t regurgitate the wild theories floating around but offer my perspective with over three decades in software engineering, including high-performance environments like Formula 1 and MotoGP.

From their public GitHub repositories and my Key Behavioral Indicators analysis, it seems that CrowdStrike's dev team skips crucial engineering steps. Proper Test-Driven Development (TDD) is non-existent. They seem to patch code rather than writing tests first, creating a house of cards. Their tests barely pass coverage checks catching only "trivial" issues. Major engineering practices like conventional commits, semantic releases, mutation testing, and fundamental like a proper testing pyramid are missing.

Acceptance Test-Driven Development (ATDD) and Behavior-Driven Development (BDD)? Nowhere to be found. These practices ensure software meets real-world needs and behaves correctly. Without them, updates are ticking time bombs.

In high-performance environments, these practices are fundamental for reliable and high-performing products. Blaming developers for such disasters is shortsighted. The real issue is systemic—a sick software industry treating development as a cost rather than an essential investment. If this status quo persists, we risk facing catastrophes far worse than rebooting Windows machines or flight cancellations.

CrowdStrike's code reflects a broader industry problem that needs addressing before we face even more severe consequences.

The Real Villain: Global Lack of SW Engineering Discipline

No Test-Driven development, lipstick DevOps, no refactoring, no clean code principles. Just quick fixes piled on, leading to fragile systems ready to collaps at the first hard hit. This sloppy approach is the real problem out there in the majority of companies—it’s an industry-wide issue. The rush to deliver sacrificing quality, leads to disasters like this one we just faced. This wasn’t just a bad day for CrowdStrike. Airlines, banks, retailers, and even law enforcement were hit hard. The bug fix, meant to be a quick patch, turned into a chain reaction of new bugs and issues. In the interconnected world of Industry 5.0, where IT integrates with human-centric approaches, this poor quality structural issue can lead to economical disasters worse than COVID!

领英推荐

Future Beat: Humbling technology

The National News 8 个月前

Organizing to Reduce the Vulnerabilities of Complexity

Ben Hutchinson 8 个月前

THIS WEEK'S TOP NEWS STORIES

Jim Garrettson 2 年前

Technology is evolving at hyperbolic speed. Companies, trying to keep up, take shortcuts, blind to the risks they’re inviting. My perspective aligns with many industry leaders, feeling particularly close to me these:

Kathryn Guarini, former IBM CIO and collegue, in her article "A Tech Crisis," emphasizes the importance of crisis management and robust software practices, one of the critical aspects we worked together in the blue days. She stresses enterprise risk management and chaos engineering for resilient systems.
Gergely Orosz, a prominent tech leader and author of “The Pragmatic Engineer,” in his article "The Biggest Ever Global Outage: Lessons," attributes the crisis to poor software engineering practices, highlighting the need for thorough testing, continuous integration, and better DevOps practices. Aligning perfectly with my point of view.
Dave Farley, co-author of “Continuous Delivery,” is a stalwart in the software engineering community. His work emphasizes the importance of robust engineering practices, continuous integration, and deployment pipelines. Farley’s insights in his video "Software's HUGE Impact On The World | Crowdstrike Global IT Outage" underscore the need for a shift towards a more reliable and sustainable approach to software development.

This fiasco is a wake-up call. In a hyper connected world reliant on robust IT, we can’t afford shortcuts. Proper engineering practices aren’t optional. They’re essential.

Leaders, We've Got a First-Aid Kit for You

Next time you hear about a tech meltdown, remember: it's not just bad luck. It’s a sign of deeper issues in how we build and maintain our software. Learn from CrowdStrike’s crash and commit to better practices. Their stock fell 30% in the blink of an eye, even before legal trials began. Can your company survive that? Probably not. Is your risk management policy sustainable or just giving you headaches?

How much (~30%) CRWD stock depreciated after the incident.

Stay sharp, and empower your IT departments to code smart with our BriX Consulting Unicorns Ecosystem. Knowledge is power, and now you know. Your IT departments are unequipped to handle this hyper-connected IT world. Your governance and risk management aren’t ready to navigate such crises. Your products can suddenly become faulty, leading you to shut down.

Don’t be foolish to continue with sloppy software engineering in your company!

Jump onboard the BriX Consulting Unicorns Ecosystem, and together, we’ll evolve your organization toward digital excellence! ??

?? Due to this week's exceptional events, we've temporarily shifted focus to address the urgent wake-up call from the CrowdStrike incident. We'll return to our deep dive into the Unicorns' Ecosystem ASAP, once the IT leadership community has fully absorbed and understood the implications of this critical situation. Stay tuned for more insights!

?? Subscribe to "The Forge of Unicorns" newsletter and stay ahead of the digital game.

?? LinkedIn

?? YouTube

?? Spotify

?? Apple Podcast

The Forge of Unicorns

445 位关注者

Brian Cunningham

7 个月

Very informative

1 次回应

要查看或添加评论，请登录

Michele Brissoni的更多文章

EP 51 - ??? Sam Palazzolo Confirms Our Research: How Smart Investors Engineer Unicorns Through Behavioral Excellence and Software Mastery

2025年3月20日

EP 51 - ??? Sam Palazzolo Confirms Our Research: How Smart Investors Engineer Unicorns Through Behavioral Excellence and Software Mastery

Hit play ?? before scrolling—this interview with Sam Palazzolo might change how you invest in IT forever. Why This…

5 条评论
EP 50 - 97.5% of IT Investments Are DOOMED ?? – Here’s How Top Investors Are Flipping the Script.

2025年3月13日

EP 50 - 97.5% of IT Investments Are DOOMED ?? – Here’s How Top Investors Are Flipping the Script.

Hey there, digital warriors! ?? Welcome to a new era of IT investment management! If you’ve been following along, you…
EP 49 - Want to Engineer Unicorns ??? The Open-Source Playbook Investors Are Already Using.

2025年3月6日

EP 49 - Want to Engineer Unicorns ??? The Open-Source Playbook Investors Are Already Using.

Hey there, digital warriors! ?? We’ve been through a journey together—from dissecting the dysfunctions plaguing IT…
EP48 - Fake Productivity: The Deadly Illusion of AI-Generated Code ??

2025年2月27日

EP48 - Fake Productivity: The Deadly Illusion of AI-Generated Code ??

81% of devs are racing towards unmanageable technical debt—here’s how to slam the brakes before it’s too late ?? Hey…

5 条评论
EP47 - Lost Millions in IT? Here’s the 300% Stock Surge Playbook Auditors Loved

2025年2月20日

EP47 - Lost Millions in IT? Here’s the 300% Stock Surge Playbook Auditors Loved

Hey there, digital warriors! ?? This article continues our deep dive into enterprise evolution, following the journey…
EP46 - Scaling Excellence: The OKR Lessons from a Bold CIO ?

2025年2月13日

EP46 - Scaling Excellence: The OKR Lessons from a Bold CIO ?

Hey there, digital warriors! ?? We’ve navigated the chaos of transformation together. Back in Episode 43, we launched a…
EP 45 - The Dojos' Ecosystem: The Engine of Evolution ??

2025年2月6日

EP 45 - The Dojos' Ecosystem: The Engine of Evolution ??

Hey there, digital warriors! ?? If you tuned into Episode 44, you saw how one remote development site, once crucial to…
EP 44 - Behind the Curtain: The Digital Evolution That Redefined Excellence

2025年1月30日

EP 44 - Behind the Curtain: The Digital Evolution That Redefined Excellence

Hey there, digital warriors! ?? In Episode 43, we set the stage for today’s story by exposing the cracks in the IT…
EP 43 - From Chaos to Gold: How IT Can Rise Like a Phoenix.

2025年1月23日

EP 43 - From Chaos to Gold: How IT Can Rise Like a Phoenix.

Hey there, digital warriors! ?? In Episode 42, we explored a hard truth: most failures in IT start at the board level…
EP 42 - 70% of Transformations Fail at the Top—Here’s How Boards Can Flip the Script Without Breaking the Bank ??

2025年1月16日

EP 42 - 70% of Transformations Fail at the Top—Here’s How Boards Can Flip the Script Without Breaking the Bank ??

Hey there, digital warriors! ?? Let’s get real: most digital transformations fail because leadership drops the ball…

See all articles

EP17 - How a Tiny Bug Took Down the World

Michele Brissoni

?? I help PE/VC investors & CTOs fix underperforming IT assets to unlock exponential ROI ?? | Behavioral Engineering Expert | Inventor of the SW Craftsmanship Dojo? | Trusted by IBM, ZF, NS & top investors

??? Friday 19th of July 2024 - The Wake-Up Call day!

The Tech Meltdown Nobody Saw Coming

The Plot Thickens - My Personal Analysis

The Real Villain: Global Lack of SW Engineering Discipline

领英推荐

Leaders, We've Got a First-Aid Kit for You

The Forge of Unicorns

445 位关注者

Michele Brissoni的更多文章

社区洞察

其他会员也浏览了

The Fragile Threads of Connectivity: Plausible Deniable Non-Lethal Collateralized Smokeless Warfare

Don't Do Stupid Things On Purpose (DDSTOP)

Sky-High Reliability: Elevating Military Aerospace Performance with Reliable Interconnect Solutions

MTS’s 7 Pillars of DP Redundancy

Microsoft Global Outage: A Testament to Technological Resilience and Industry Cooperation

Multi-Domain Command and Control and the Case for Containerisation

A Day the World Stood Still: Reflections on the Global Tech Outage

Ampex Announces Contract with The Boeing Company: B-52 Fault Maintenance Recorder

Network Access Layer Issues

Why European Governments Should Include Cell Barring in their national Legislation.

??? Friday 19th of July 2024 - The Wake-Up Call day!

The Tech Meltdown Nobody Saw Coming

The Plot Thickens - My Personal Analysis

The Real Villain: Global Lack of SW Engineering Discipline

领英推荐

Leaders, We've Got a First-Aid Kit for You

The Forge of Unicorns

445 位关注者

Michele Brissoni的更多文章

EP 51 - ??? Sam Palazzolo Confirms Our Research: How Smart Investors Engineer Unicorns Through Behavioral Excellence and Software Mastery

EP 50 - 97.5% of IT Investments Are DOOMED ?? – Here’s How Top Investors Are Flipping the Script.

EP 49 - Want to Engineer Unicorns ??? The Open-Source Playbook Investors Are Already Using.

EP48 - Fake Productivity: The Deadly Illusion of AI-Generated Code ??

EP47 - Lost Millions in IT? Here’s the 300% Stock Surge Playbook Auditors Loved

EP46 - Scaling Excellence: The OKR Lessons from a Bold CIO ?

EP 45 - The Dojos' Ecosystem: The Engine of Evolution ??

EP 44 - Behind the Curtain: The Digital Evolution That Redefined Excellence

EP 43 - From Chaos to Gold: How IT Can Rise Like a Phoenix.

EP 42 - 70% of Transformations Fail at the Top—Here’s How Boards Can Flip the Script Without Breaking the Bank ??

社区洞察

其他会员也浏览了

The Fragile Threads of Connectivity: Plausible Deniable Non-Lethal Collateralized Smokeless Warfare

Don't Do Stupid Things On Purpose (DDSTOP)

Sky-High Reliability: Elevating Military Aerospace Performance with Reliable Interconnect Solutions

MTS’s 7 Pillars of DP Redundancy

Microsoft Global Outage: A Testament to Technological Resilience and Industry Cooperation

Multi-Domain Command and Control and the Case for Containerisation

A Day the World Stood Still: Reflections on the Global Tech Outage

Ampex Announces Contract with The Boeing Company: B-52 Fault Maintenance Recorder

Network Access Layer Issues

Why European Governments Should Include Cell Barring in their national Legislation.