Lessons From The CrowdStrike Outage
Last Friday, CrowdStrike experienced a significant issue with a content update for its Falcon sensor on Windows hosts, causing widespread IT outages affecting customers worldwide, including airlines, banks, shops, and broadcasters.
?Surprisingly, I didn't hear about this until one of my employees told me. But that is probably because I use a Mac - and the outage didn't affect me.
So, when my son decided we should write an article about it ... I asked Perplexity.AI , "Can you explain the issue with CrowdStrike? "?
Perplexity?is an?AI chatbot -powered research and conversational?search engine ?that answers queries using?natural language predictive text . I've been quite impressed with it. If you haven't tried it yet, I highly recommend checking it out.
Here is an excerpt of what it told me about the incident:
Character of the problem:
Scope of the impact:
CrowdStrike's response:
Important clarifications:
The scope of the outage was surprising.?
United, American, and Delta all called complete ground stops. Microsoft was hit. Public displays around the world showed the blue screen of death.?
All because CrowdStrike pushed a global update. That patch caused every computer with CrowdStrike to crash. Even worse, these computers can only be fixed in person by an IT professional. Because it involves a Blue Screen of Death , IT can't just remote in to fix it.??
It's probably the largest outage in history and has caused untold damage. It affected emergency services in some states and countries.?
Even after a patch is issued, it may take days for things to return to normal, as each endpoint requires individual attention, and some systems might have suffered complete failures.
领英推荐
Randall Munroe via XKCD
It's a healthy reminder that our 'robust' infrastructure isn't always so robust ... and that tech consolidation and concentration can have consequences.
While there are a seemingly infinite number of tech companies now, the infrastructure has consolidated into the hands of very few. We need to think about our digital resilience, not just in the systems we run, but in the globally connected systems and in the growing Internet of Things.?
Does your business have all of its eggs in one basket? Does it have failsafes in case of an emergency?
As I observe the growing adoption of AI, I notice that people tend to emphasize its capabilities over its potential failures. In our increasingly interconnected and automated world, ensuring business continuity is more crucial than ever.
Bonus Story
CrowdStrike isn't the first - and it certainly won't be the last - to experience this.
Single Points of Failure are pretty common.
AT&T has a windowless building in Tribeca, Manhattan.
It's at 33 Thomas Street, and it's now mostly decommissioned. It is a relic of a time when telecommunication required massive long-distance switching hubs. This building was home to the giant machines (switches) that enabled long-distance calls when they traveled over copper wire.
33 Thomas managed the entire eastern coast, essentially. But, in 1991, a combo of a power failure and human error sent this one building into the dark for the first time ever (or since).
And all hell broke loose. Millions of calls were dropped. Transatlantic communication ceased. Air traffic control went down.
Even worse, until that day the FAA didn't even realize this was a single point of failure.
While the building has changed since then, even at one point being an NSA hub, it's a reminder that things like this happen - and the world continues to turn.
It's important to do our best to avoid these disasters, but we must also realize that they are unavoidable and prepare for the inevitable.
Sign up for my Weekly Newsletters here .
5x Inc 5000 Serial Entrepreneur, Speaker, Growth Hacker, Digital Marketing Strategist, Formula 4 Race Team Owner - FLBR Motorsport.
4 个月??
Thank you’