Lessons From The CrowdStrike Outage

Lessons From The CrowdStrike Outage

Last Friday, CrowdStrike experienced a significant issue with a content update for its Falcon sensor on Windows hosts, causing widespread IT outages affecting customers worldwide, including airlines, banks, shops, and broadcasters.

?This outage dominated the news. Here are links to stories from Google and Bing search.

?Surprisingly, I didn't hear about this until one of my employees told me. But that is probably because I use a Mac - and the outage didn't affect me.

So, when my son decided we should write an article about it ... I asked Perplexity.AI , "Can you explain the issue with CrowdStrike? "?

Perplexity?is an?AI chatbot -powered research and conversational?search engine ?that answers queries using?natural language predictive text . I've been quite impressed with it. If you haven't tried it yet, I highly recommend checking it out.

Here is an excerpt of what it told me about the incident:

Character of the problem:

  • A defect was found in a single content update for CrowdStrike's Falcon sensor on Windows hosts.
  • This caused Windows machines to crash, resulting in a "blue screen of death" (BSOD) and getting stuck in a restarting state.

Scope of the impact:

  • Only Windows hosts were affected; Mac and Linux systems were not impacted.
  • The issue affected various organizations globally, including banks, airlines, and other businesses.

CrowdStrike's response:

  • The company quickly identified and isolated the issue.
  • A fix was deployed, and the problematic update was pulled.
  • CrowdStrike is actively working with affected customers to resolve the problem.

Important clarifications:

  • This was not a security incident or cyberattack.

Click here to see the whole perplexity.ai response.

The scope of the outage was surprising.?

United, American, and Delta all called complete ground stops. Microsoft was hit. Public displays around the world showed the blue screen of death.?

All because CrowdStrike pushed a global update. That patch caused every computer with CrowdStrike to crash. Even worse, these computers can only be fixed in person by an IT professional. Because it involves a Blue Screen of Death , IT can't just remote in to fix it.??

It's probably the largest outage in history and has caused untold damage. It affected emergency services in some states and countries.?

Even after a patch is issued, it may take days for things to return to normal, as each endpoint requires individual attention, and some systems might have suffered complete failures.

Randall Munroe via XKCD

It's a healthy reminder that our 'robust' infrastructure isn't always so robust ... and that tech consolidation and concentration can have consequences.

While there are a seemingly infinite number of tech companies now, the infrastructure has consolidated into the hands of very few. We need to think about our digital resilience, not just in the systems we run, but in the globally connected systems and in the growing Internet of Things.?

Does your business have all of its eggs in one basket? Does it have failsafes in case of an emergency?

As I observe the growing adoption of AI, I notice that people tend to emphasize its capabilities over its potential failures. In our increasingly interconnected and automated world, ensuring business continuity is more crucial than ever.

Bonus Story

CrowdStrike isn't the first - and it certainly won't be the last - to experience this.

Single Points of Failure are pretty common.

AT&T has a windowless building in Tribeca, Manhattan.


It's at 33 Thomas Street, and it's now mostly decommissioned. It is a relic of a time when telecommunication required massive long-distance switching hubs. This building was home to the giant machines (switches) that enabled long-distance calls when they traveled over copper wire.

33 Thomas managed the entire eastern coast, essentially. But, in 1991, a combo of a power failure and human error sent this one building into the dark for the first time ever (or since).

And all hell broke loose. Millions of calls were dropped. Transatlantic communication ceased. Air traffic control went down.

Even worse, until that day the FAA didn't even realize this was a single point of failure.

While the building has changed since then, even at one point being an NSA hub, it's a reminder that things like this happen - and the world continues to turn.

It's important to do our best to avoid these disasters, but we must also realize that they are unavoidable and prepare for the inevitable.


To read more, you can find my blog here or follow me on Twitter here .

Sign up for my Weekly Newsletters here .

Josh Felber

5x Inc 5000 Serial Entrepreneur, Speaker, Growth Hacker, Digital Marketing Strategist, Formula 4 Race Team Owner - FLBR Motorsport.

4 个月

??

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了