登录查看更多内容

CrowdStrike Outage: Who Tests the Tester? When your Test Tools go Bad. ??

Peter Joseph

Independent Cloud & Technology Advisor. Ex-AWS, Ex-Microsoft.

发布日期: 2024年7月26日

Pushed to Production Without Recognisable Testing

CrowdStrike have published their Preliminary Incident Report (PIR) on the 19/7 outages. I try to avoid being sensational, but I’m unsettled by the initial findings.

The update to Windows host running CrowdStrike Falcon that impacted approximately 8.5 million machines on 19th July, had never been deployed to a real running system prior to getting pushed out.

Let me say that again. The update, pushed to 8.5 million production machines, went directly to production. Not run up on anything real, not staged across production environments, none of that.

Prior to deployment the update was processed through custom CrowdStrike internal tooling they call a ‘Content Validator’ and deemed "valid". The PIR states that due to “a bug in the Content Validator, one of the two Template instances passed validation despite containing problematic content data”.

I will go as far as saying that this does not constitute recognisable testing by any definition. Given the scope of impact / blast radius to ever update a simple pass through an internally written validation tool simply does not pass the ‘sniff test’

The Bugs

At least two bugs here. A bug/error in the update being deployed, that caused the mass outages. A bug in the ‘Content Validator’ meant to stop this happening. How I see it, at least another bug that means that invalid content in the updates can crash the CrowdStrike kernel driver and Blue Screen servers at all.

As is often the case with issues like this, if any one of those bugs had been caught, the incident wouldn’t have happened.

In the interests of balance, CrowdStrike describe sensible testing approaches when they release new software versions and also for the Content Validator itself. However, it seems after a few successful content releases they deem full testing unnecessary and simply run things through the Content Validator.

What are CrowdStrike changing?

The PIR contains a handful of meaningful changes, that in my view, is frankly unacceptable that these things were not already happening. There is nothing revolutionary on this list.

The PIR states they will apply the following to content testing (which presumably means all future updates?):-

·???????? Local developer testing

·???????? Content update and rollback testing

·???????? Stress testing, fuzzing and fault injection

·???????? Stability testing

·???????? Content interface testing

·???????? Additional validation checks to guard against this type of problematic content.

领英推荐

Crowdstrike Outage: A Lesson in Security Patching Gone…

Theresa McFarlane 8 个月前

Istio Fault Injection: Introducing Faults for…

Christopher Adamson 10 个月前

Automating Incident Response: Leveraging Grafana…

Ibukun B. 9 个月前

·???????? Staged deployment, canary deployment etc.

·???????? Providing customers with control over delivery of content.

·???????? Provide content details via release notes, which customers can subscribe two

Third Party Validation

The report also? covers two additional actions:-

·???????? Conduct multiple independent third-party security code review.

·???????? Conduct independent review of end-to-end quality processes from development through to deployment.

My Final Thoughts

The technical causes of what must be the largest single outage in history are mundane, as often is the cause of ‘accidents’.

From what I’m seeing currently, to sum up the root cause in a word:-

Overconfidence.

Overconfidence and faith in the self-written Content Validator that said there was no problem with the update prior to deploying to 8.5 million machines. ?

This overconfidence has cost CrowdStrike dearly in terms of reputational damage and the inevitable litigation to come. It’s cost the world dearly in productivity, financial impact and I suspect once you sum up healthcare disruptions, failures to emergency service lines etc., likely lives too.

The lack of robust testing of any release that has any technical possibility whatsoever to cause an impact is a clear failing on CrowdStrike's part. There was not some procedural mistake made, the update that was pushed out followed CrowdStrike's intended release flow.

Step back, ask yourself of your processes and procedures, “what’s the worst that could happen?”, if all the checks and balances fail, what can go wrong? Then ask, do we do enough to mitigate the worst cases.

?You can read the full report here.

https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

Haydi Foster

Senior Business Analyst & Transformation Leader | Driving Change through Process Improvement & Stakeholder Engagement

7 个月

Crazy how this simple test step did not happen in this technology day and age.

Ryan Ashton

7 个月

Proof reading your own writing... I can't get that right even with the help of squiggly red lines. Conceptually it makes sense, but what is standard practice across the industry for zero day vulnerabilities? I look forward to reading your report, which no doubt won't have any red squiggly lines... right? ;-)

Gregory Hunt

8 个月

Hopefully one good thing that comes out of all this is an increased expectation that vendors prove the suitability of their internal processes via 3rd party audits/certification, especially when the software in question is being allowed into ring 0 (kernel).

2 次回应

查看更多评论

要查看或添加评论，请登录

Peter Joseph的更多文章

The Crowd Strikes Back: The 2024 CrowdStrike Incident: A Tale of Checkbox Testing?

2024年9月13日

The Crowd Strikes Back: The 2024 CrowdStrike Incident: A Tale of Checkbox Testing?

In July 2024, a update to CrowdStrike a cybersecurity application deployed on millions of computers worldwide caused…

4 条评论
Code from *1968* hits AWS! My First Encounter with COBOL in the wild in a 20+ year career

2024年7月16日

Code from *1968* hits AWS! My First Encounter with COBOL in the wild in a 20+ year career

In my two or so decades of working in the tech, I've had I’ve been fortunate to get my hands on many different…

4 条评论
Traefik. The NGINX killer? Deploying reverse proxies for Docker.

2024年6月7日

Traefik. The NGINX killer? Deploying reverse proxies for Docker.

Discovering Traefik An organisation I help after receiving some eye-watering pricing for an "enterprise grade reverse…
What's better, AWS or Azure? : The Thing I'm Asked Most

2024年5月19日

What's better, AWS or Azure? : The Thing I'm Asked Most

"What's better, AWS or Azure?", I'm asked this a lot, probably more than any other single question in my professional…

10 条评论
UK to New Zealand, Checkboxes to Security Culture

2024年5月14日

UK to New Zealand, Checkboxes to Security Culture

I just got back from the UK where I had the privilege of spending time with a large transport customer, where I helped…

3 条评论
?? Embracing Challenges: Learning, Growing, and Building and being VERY late to the 3D Printing Party!

2024年2月13日

?? Embracing Challenges: Learning, Growing, and Building and being VERY late to the 3D Printing Party!

?? Embracing Challenges: Learning, Growing, and Building and being VERY late to the 3D Printing Party! I’ve spent the…

3 条评论
"Technofeudalism: What Killed Capitalism" – My Thoughts on Varoufakis' Latest Book

2023年11月13日

"Technofeudalism: What Killed Capitalism" – My Thoughts on Varoufakis' Latest Book

"Technofeudalism: What Killed Capitalism" – My Thoughts on Varoufakis' Latest Book. I just finished Yanis Varoufakis’…

2 条评论
DevOps failed, but that doesn't matter

2023年3月7日

DevOps failed, but that doesn't matter

DevOps failed, but that doesn't matter; it has become something else, and that 'something else' has had mostly positive…
re:Invent Day Four Summary - 9.51km walked and Chaos Engineering!

2017年12月1日

re:Invent Day Four Summary - 9.51km walked and Chaos Engineering!

re:Invent Day Four Summary - 9.51km walked The day began at the MGM Hotel, the new venue for Dr.

1 条评论

See all articles

CrowdStrike Outage: Who Tests the Tester? When your Test Tools go Bad. ??

Peter Joseph

Independent Cloud & Technology Advisor. Ex-AWS, Ex-Microsoft.

Pushed to Production Without Recognisable Testing

The Bugs

What are CrowdStrike changing?

领英推荐

Third Party Validation

My Final Thoughts

Peter Joseph的更多文章

社区洞察

其他会员也浏览了

For Want of a Nail: The Case for Preventive Action in Software Development

Global Outage Caused by CrowdStrike: Implications for Quality Engineering

As a reliability engineer, what are you “must-have” tools/services that make your job easier, more effective and/or more enjoyable

Enhancing Resilience in .NET Core with Polly: A Comprehensive Guide

What is DevSecOps and Why is it Growing?

What Is Detection as Code(DaC) and its Components along with its benefits?

Managing the Human Error Triangle to Reduce Incidents in Software Development

Preventing Major Security Incidents - Lessons from the CrowdStrike Outage

Progressions in AppSec Testing

Lessons from the CrowdStrike Blackout: The Critical Importance of Rigorous Testing

Pushed to Production Without Recognisable Testing

The Bugs

What are CrowdStrike changing?

领英推荐

Third Party Validation

My Final Thoughts

Peter Joseph的更多文章

The Crowd Strikes Back: The 2024 CrowdStrike Incident: A Tale of Checkbox Testing?

Code from *1968* hits AWS! My First Encounter with COBOL in the wild in a 20+ year career

Traefik. The NGINX killer? Deploying reverse proxies for Docker.

What's better, AWS or Azure? : The Thing I'm Asked Most

UK to New Zealand, Checkboxes to Security Culture

?? Embracing Challenges: Learning, Growing, and Building and being VERY late to the 3D Printing Party!

"Technofeudalism: What Killed Capitalism" – My Thoughts on Varoufakis' Latest Book

DevOps failed, but that doesn't matter

re:Invent Day Four Summary - 9.51km walked and Chaos Engineering!

社区洞察

其他会员也浏览了

For Want of a Nail: The Case for Preventive Action in Software Development

Global Outage Caused by CrowdStrike: Implications for Quality Engineering

As a reliability engineer, what are you “must-have” tools/services that make your job easier, more effective and/or more enjoyable

Enhancing Resilience in .NET Core with Polly: A Comprehensive Guide

What is DevSecOps and Why is it Growing?

What Is Detection as Code(DaC) and its Components along with its benefits?

Managing the Human Error Triangle to Reduce Incidents in Software Development

Preventing Major Security Incidents - Lessons from the CrowdStrike Outage

Progressions in AppSec Testing

Lessons from the CrowdStrike Blackout: The Critical Importance of Rigorous Testing

Code from 1968 hits AWS! My First Encounter with COBOL in the wild in a 20+ year career