How to prevent your own “CrowdStrike Incident”

How to prevent your own “CrowdStrike Incident”

While every software company could have its own CrowdStrike-type issue, I firmly believe it doesn’t have to be this way, if…?

I’ll get back to that. ?

First, the recent CrowdStrike incident should be a wake-up call for all software companies that think going fast means skipping vital testing.? I commend CrowdStrike for quickly claiming responsibility and publishing its Root Cause Analysis (RCA). ??Channel File 291 RCA Exec Summary ? I’m sure they were under tremendous pressure to do so.? Regardless, they did not have to be as transparent as they were. ?It was the honorable thing to do.

Let’s talk about how we “all” got here…?

The forces of our industry have been increasing the pressure on software companies to provide updates, enhancements, and features as quickly as possible.? The customers are now conditioned and expecting constant releases. However, there is a “silent” expectation that all software will be thoroughly tested including production-like environments before it gets released to production. The challenge of meeting these expectations grows exponentially as software becomes more complex, innovative, and expansive.? And let’s face it, test capabilities could barely keep up 10 years ago, never mind living up to today’s expectations. ?So, it is no wonder that companies have time-boxed and limited spending on Software testing. The return on their “test” investment isn’t paying off.

The other phenomenon is the false sense of security companies fall into all the time. Everyone gets comfortable with their software release cadences, whether it is daily, weekly, bi-weekly, monthly, or longer.? “Our last n releases went flawlessly, so we must be doing things right”?? Well, maybe.

How often does “changed content” get flagged for extended testing or testing in production-like environments outside the release cadence? Have the changes impacted performance or durability that cannot be seen until enough accelerated life or scale has been reached? Or would the changes withstand adverse conditions testing?? All these test scenarios are time-consuming to create and set up, especially if they have to mimic production environments or accelerated life. Not to mention the people and resource costs.?

“And, aren’t my software suppliers supposed to fully test their software before releasing it to us?” ?Well, they’re working under the same “Industry” constraints as you, maybe worse.

Additionally, people build up biases and get used to doing the same ol’ thing until the customer finds a major defect. “We have 100% Unit Test coverage”, which translates to all the checked-in code having unit tests associated with them. It may not mean that every variable or variation has been thoroughly tested.? Inevitably, biases build up, and corners get cut, mostly unintentionally.? Regardless, some of these missed defects will undoubtedly wreak havoc in unimaginable ways and usually at the worst possible times.

What can companies do to prevent these similar CrowdStrike incidents (escapes) from happening?

Below are key steps every software company can implement for better execution during the software test life cycle (STLC).

8 Proactive Measures to Prevent Software Disasters

1.?????? Test Capabilities Assessment:

a.?????? Have an independent non-biased 3rd party determine if your existing test capabilities are effective and provide coverage for your software’s functional and non-functional requirements. ?

2.?????? Risk Assessment:

a.?????? Requirements Communications: Determine if there is a single source of truth between Product, Development, and Test

b.?????? Security Risk: Determine if your software and 3rd party supplier software are secure or at risk for known vulnerabilities

3.?????? Improvement Plan:

a.?????? Implement the recommended improvements discovered during the Test Capabilities Assessment

b.?????? Fill in the missing gaps and eliminate redundancies (Phased if necessary): Test Automation Framework, Test Suites, & Test Methods:

4.?????? Automation:

a.?????? Automate everything possible

b.?????? Implement a continuous testing process

5.?????? Developer Testing:

a.?????? Ensure that Unit Testing is providing 100% code coverage

6.?????? Advanced Test Capabilities:

a.?????? Investigate and implement advanced test strategies that include Adverse Conditions Testing (Chaos Testing), User Scenarios testing, Accelerated Life Testing, Scale & Performance Testing, and Exhaustive Regression Test suites, which include testing 3rd Party Suppliers and Infrastructure Providers

7.?????? Continuous Assessment & Monitoring:

a.?????? Implement a continuous assessment (AI/ML) process to determine if any changes need extenuating testing.

8.?????? Robust Deployment Strategy:

a.?????? Implement a deployment strategy that includes health checking, limited rollout, isolated feature enablement, and a rollback strategy for unexpected issues.

The recent CrowdStrike incident is a stark reminder of the dangers of rushing software releases without thorough testing, and not knowing how effective and comprehensive your test capabilities are. While the industry pressures companies to deliver updates and features at breakneck speed, this often comes at the cost of rigorous testing, leading to catastrophic failures like the one we just witnessed. Companies must reassess their testing strategies, eliminate gaps, and adopt advanced, automated testing methods to avoid similar incidents. Organizations can significantly reduce risk by prioritizing comprehensive testing, continuous testing, and continuous assessments to ensure their software performs reliably under all conditions.

Please be sure to share this article with your colleagues. Feedback Welcome!

I write about improving software quality and operational excellence in the SDLC by making testing seamless, effective, and efficient.

?

David Lewis

Strategic Growth Leader | Board Advisor | Early Stage Investor

2 个月

Appreciate the perspective Steve Halzel, better testing = better outcomes.

回复
Eran Kinsbruner

Lightrun’s Global Head of Product Marketing and Brand Strategy ?? Best-Selling Author ?? FinOps Certified Practitioner ?? Keynote Speaker?? Advisory Board??6 x Top LinkedIn Voice ??Marquis Who's Who Top Executive Listee

2 个月

Thanks for sharing Steve, great article

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了