Well, I did not expect this...
Search on Google for images of Crowdstrike BSOD

Well, I did not expect this...

When I wrote my post last week on recovering after doomsday, I did not expect reality to catch up this fast and this thoroughly. Now, a few days later, companies are still struggling to recover: I just read that some airlines are still not functioning completely.

What happened?

At least one hero seems to have been able to reverse engineer what caused the outage. On LinkedIn you can read several posts like this one, all referring to the same cause: a C++ programming error causing Windows to go into panic block mode.

It makes you wonder. If this really is the cause, any basic QA in the update process should have caught this error and blocked further proliferation. I find it hard to imagine that a company with a track record of CrowdStrike falls victim to a beginners coding error (apparently failing to do a proper check on pointer values) and obvious omission in QA. Loading this update should have triggered even the most basic test. Especially if they claim they do updates like this every few hours. But then again: we are in a people's business and people make mistakes.

Some observations

Airport departure and arrival screen failures

What struck me immediately was that departure and arrival screens on airports went blue. These are - imho - IoT systems I would expect to run some very basic, minimized OS, containing only a browser. Not Windows.

But even if Windows is a logical choice, why on earth do you install end-point security software on IoT systems? These systems should not have any working physical interfaces except for the display connector, the network port and a power socket and should network wise be in a very locked down part of the infrastructure. As you know exactly what these systems should do, any deviation from that pattern should lead to an immediate lockout of the system.

Scenario's like this have been presented on Cisco Live for years and an excellent presentation was done last year in Melbourne, Amsterdam and Las Vegas by Jerome Dolphin.

End-user terminals

I'd say the same on the terminals the flight attendants are working on to check you in, POS terminals, etc. I would expect these to be kiosk-like systems that can only connect to their respective back-end system.

There is no need for generic functionality, especially not in open, publicly accessible environments like airport terminals or shops. What you see here is that systems are not stripped of functionality until only the bare minimum is left. Instead you'll find what I come across at many clients: everything that could be turned-on is turned-on, complicating the system, introducing unexpected vulnerabilities and opening the system up to more attack vectors than strictly necessary. And worse even: it entices the users to start doing things they should not do.

Rebooting should be sufficient to get end-user terminals up and running, and PXE boot can provide fresh images when needed. Here too several possibilities have been around for years: think about the default solutions for roaming users in Cisco ISE that quarantine a system if it does not pass the posture checks. A simple reboot should suffice to be back operational in under one hour.

And here too: any abnormal, heuristically unexpected behavior should lock the system out and send security over to check out who's behind the terminal.

Fat clients

As for the fat clients in offices: any office building has intercom systems to do emergency announcements. Just tell people to reboot their machines. The normal quarantine, profiling and posture checking and subsequent remediation process should suffice. Never ever should there be a need to physically go to each and every machine for recovery. If you need that, your architecture misses some essential points on scalability.

Roaming users

Roaming users may present issues. When the users are using fat clients, returning to the office and rebooting the system should send them into the posture check and remediation process, so they too can be up and running quickly. Yes, that implies going to the office. But if that is the worst that happens in an outage case like this, that is acceptable imho.

Servers

The real challenge is for servers. I do understand the need for end-point protection on these systems, but this outage shows that this will not prevent functionality loss for the business. As explained before servers should not be patched, but rebuilt on every change. This requires rethinking of your processes around application management and operations.

Modern data center strategies and future-proof implementation of applications - and I do not mean we have to push everything over to containers (just yet) - permit for quick recovery of even complete data centers.

Coffee anyone?

Think about that. What would you need to recover after a failure like this? If you want to find out, contact me for a cup of coffee and we explore the options.

Rafael Magalh?es de Andrade

Cloud Backend Engineer at Yokoy | Software Architect, Lead Developer, Senior Software Engineer

8 个月

Mark S. I was finishing reading your posts last week and, on Friday when I heard about the CrowdStrike issue, they were the first thing I thought about ?? This new post could be just a one line "I told you", but as usual you have added a lot of nice perspectives. It seems that some concepts like "less privileged" or "immutable infrastructure" is something the market still needs to learn more. But as you said, the farther you are from the last occurrence, the less you feel the risk. Very glad to read such a good content ??

回复
Sijbren Beukenkamp

Solution Architect | CCIE #8440 | Cisco Champion | Webex Insider Expert | Cisco Certified Specialist: Collaboration Core

8 个月

Thinking out loud: should an operating system not be more robust? The Microsoft OS detects the error and even has time to figure out which libary caused it. Why not flag the library, reboot and continue in fail safe or something?

回复
Antoon Huiskens

Consultant at Devoteam

8 个月

interestingly, few commenters out there gave consideration to the possibility that crowdstrike might have considered waiving the insufficient testing in light of the need to address a critical rollout of a high sev vulnerability mitigation.

Here the REAL root cause of CrowdStrike disaster: Microsoft driver certification bypass. Here explained in Spanish: https://lnkd.in/dqXzUKex Technical details in English: https://lnkd.in/dgu9m_Hq

回复

要查看或添加评论,请登录

Mark S.的更多文章

  • Are you ready?

    Are you ready?

    Over the past months, I have been working on an obective method to communicate the readiness of an organization for…

  • What if you have nothing left, part 3: recovery

    What if you have nothing left, part 3: recovery

    This is part 3 in a series on Business Continuity Managament. In part 1 I showed that the need for full recovery is no…

    5 条评论
  • What if you have nothing left (part 2)?

    What if you have nothing left (part 2)?

    This is part 2 of a series of articles about Business Continuity. In part 1 I showed that full recovery is no…

  • What if you have nothing left?

    What if you have nothing left?

    When talking to peers - vendors, customers and colleagues alike - there is something that strikes me. A lot still think…

    2 条评论
  • Opportunities everywhere...

    Opportunities everywhere...

    Over the years, I have seen quite a number of organizations, big and small, and I have tried to figure out why they…

    4 条评论

社区洞察

其他会员也浏览了