CrowdStrike Outage: Who Tests the Tester? When your Test Tools go Bad. ??
One Does Not Simply.

CrowdStrike Outage: Who Tests the Tester? When your Test Tools go Bad. ??

Pushed to Production Without Recognisable Testing

CrowdStrike have published their Preliminary Incident Report (PIR) on the 19/7 outages. I try to avoid being sensational, but I’m unsettled by the initial findings.

The update to Windows host running CrowdStrike Falcon that impacted approximately 8.5 million machines on 19th July, had never been deployed to a real running system prior to getting pushed out.

Let me say that again. The update, pushed to 8.5 million production machines, went directly to production. Not run up on anything real, not staged across production environments, none of that.

Prior to deployment the update was processed through custom CrowdStrike internal tooling they call a ‘Content Validator’ and deemed "valid". The PIR states that due to “a bug in the Content Validator, one of the two Template instances passed validation despite containing problematic content data”.

I will go as far as saying that this does not constitute recognisable testing by any definition. Given the scope of impact / blast radius to ever update a simple pass through an internally written validation tool simply does not pass the ‘sniff test’

The Bugs

At least two bugs here. A bug/error in the update being deployed, that caused the mass outages. A bug in the ‘Content Validator’ meant to stop this happening. How I see it, at least another bug that means that invalid content in the updates can crash the CrowdStrike kernel driver and Blue Screen servers at all.

As is often the case with issues like this, if any one of those bugs had been caught, the incident wouldn’t have happened.

In the interests of balance, CrowdStrike describe sensible testing approaches when they release new software versions and also for the Content Validator itself. However, it seems after a few successful content releases they deem full testing unnecessary and simply run things through the Content Validator.

What are CrowdStrike changing?

The PIR contains a handful of meaningful changes, that in my view, is frankly unacceptable that these things were not already happening. There is nothing revolutionary on this list.

The PIR states they will apply the following to content testing (which presumably means all future updates?):-

·???????? Local developer testing

·???????? Content update and rollback testing

·???????? Stress testing, fuzzing and fault injection

·???????? Stability testing

·???????? Content interface testing

·???????? Additional validation checks to guard against this type of problematic content.

·???????? Staged deployment, canary deployment etc.

·???????? Providing customers with control over delivery of content.

·???????? Provide content details via release notes, which customers can subscribe two

Third Party Validation

The report also? covers two additional actions:-

·???????? Conduct multiple independent third-party security code review.

·???????? Conduct independent review of end-to-end quality processes from development through to deployment.

My Final Thoughts

The technical causes of what must be the largest single outage in history are mundane, as often is the cause of ‘accidents’.

From what I’m seeing currently, to sum up the root cause in a word:-

Overconfidence.

Overconfidence and faith in the self-written Content Validator that said there was no problem with the update prior to deploying to 8.5 million machines. ?

This overconfidence has cost CrowdStrike dearly in terms of reputational damage and the inevitable litigation to come. It’s cost the world dearly in productivity, financial impact and I suspect once you sum up healthcare disruptions, failures to emergency service lines etc., likely lives too.

The lack of robust testing of any release that has any technical possibility whatsoever to cause an impact is a clear failing on CrowdStrike's part. There was not some procedural mistake made, the update that was pushed out followed CrowdStrike's intended release flow.

Step back, ask yourself of your processes and procedures, “what’s the worst that could happen?”, if all the checks and balances fail, what can go wrong? Then ask, do we do enough to mitigate the worst cases.

?You can read the full report here.

https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/


Haydi Foster

Senior Business Analyst & Transformation Leader | Driving Change through Process Improvement & Stakeholder Engagement

7 个月

Crazy how this simple test step did not happen in this technology day and age.

回复
Ryan Ashton

AFQY + Smartspace.ai + GOVERNANCE4 ~ Fractional Client Engagement | Community Builder | People & Culture | Technology | MC | Mental Health Advocate

7 个月

Proof reading your own writing... I can't get that right even with the help of squiggly red lines. Conceptually it makes sense, but what is standard practice across the industry for zero day vulnerabilities? I look forward to reading your report, which no doubt won't have any red squiggly lines... right? ;-)

回复

Hopefully one good thing that comes out of all this is an increased expectation that vendors prove the suitability of their internal processes via 3rd party audits/certification, especially when the software in question is being allowed into ring 0 (kernel).

要查看或添加评论,请登录

Peter Joseph的更多文章

社区洞察

其他会员也浏览了