Ensuring System Reliability through Traditional Testing & Quality Engineering: Lessons from the CrowdStrike Outage

Nikhil Joshi

Empowering Digital Transformation through AI Governance

发布日期: 2024年7月21日

The recent CrowdStrike outage on July 19, 2024, disrupted operations across major industries, revealing critical vulnerabilities in our digital infrastructure. This incident underscores the necessity of traditional testing methods and highlights the shared responsibility of major tech companies, including Microsoft, in maintaining system stability.

CrowdStrike, a leader in cybersecurity, provides the Falcon Security agent widely used across various industries. On July 19, 2024, an untested update to the Falcon Sensor led to widespread connectivity issues and system reboots on Windows systems. This update also disrupted Microsoft’s Azure cloud platform and other cloud platforms like AWS, Google Cloud, IBM Cloud, and Oracle Cloud, which have a significant percentage of their virtual machines running Windows-based systems, further exacerbating the situation.

The faulty update impacted millions of Windows systems across the globe, highlighting Microsoft's role in ensuring compatibility and stability with third-party updates. This incident underscores the shared responsibility between CrowdStrike, Microsoft, and other cloud platforms. The outage disrupted operations in sectors such as airlines, banking, and media, highlighting the broader implications for global business operations and consumer trust. Both CrowdStrike and Microsoft faced financial and reputational damage. Businesses worldwide experienced operational challenges and downtime, underscoring the need for robust preventive measures.

Financial and Operational Impact

The outage resulted in significant financial losses and operational disruptions:

IT Impact: Millions of Windows computers—personal computers, workstations, high-end servers, virtual machines, and cloud-based server systems—were affected globally.
People Impact: The outage affected millions of users who rely on Windows-based systems protected by CrowdStrike, not counting the hundreds of millions whose every day life was disrupted severely.
Global Impact: The issue had a worldwide impact, affecting multiple countries across different continents.
Impact to Industries: Key sectors including airlines, banking, healthcare, and media experienced significant disruptions.
Financial Impact: Exact figures are not available yet, but the overall financial impact is substantial. CrowdStrike’s share price dropped by 11.1%, closing at $304.96.

Preventive Measures and Lessons Learned

What went wrong and how can this be fixed?

This event will go down in the history of Information Technology as one of the biggest incidents that could have potentially been avoided with a more conventional and traditional approach to software testing. Rigorous testing, better collaboration, and communication could have prevented this. Investing in high-quality standards in software development is crucial. The long-term benefits of prioritizing quality over speed far outweigh the costs.

Dynatrace 1 个月前

The Evolution of Secure Software Delivery: Trends and…

OpsMx 7 个月前

Power Platform Environment Strategy: A Comprehensive…

Marcel Broschk 3 个月前

Traditional Testing and Quality Engineering Methods

Traditional testing includes manual testing, regression testing, pre-production testing, and real-world scenario testing. These methods are critical for identifying and mitigating bugs that automated testing might miss. Reliability testing and other conventional methods ensure that updates do not introduce new bugs or conflicts. They also validate that updates do not disrupt existing security measures and user impact. Thorough testing prevents widespread disruptions by validating updates before release.

Neglecting traditional testing and quality engineering contributes to accumulating technical debt, which compromises system stability and security. Addressing technical debt is essential for preventing costly disruptions.

AI and Governance

While it's not clear how or if CrowdStrike leveraged AI in this instance, the reliance on AI-based systems, including automated coding, testing and decisioning , while beneficial in many aspects, may not always yield the best outcomes. The CrowdStrike incident emphasizes the need for robust IT and AI governance to ensure that automated systems are subject to rigorous oversight and traditional testing methodologies. Ensuring the governance of AI systems involves setting clear guidelines, conducting regular audits, and maintaining transparency in AI decision-making processes.

Conclusion

The CrowdStrike outage highlights the importance of traditional testing and quality engineering methods and the shared responsibility of tech giants like Microsoft in maintaining system stability. It is imperative for organizations to invest in robust testing frameworks and prioritize comprehensive quality engineering in their software development lifecycle to ensure reliability and prevent future incidents. Additionally, there should be a stronger focus on governance, particularly around AI, to mitigate risks associated with automated decision-making processes.

Manik Gupta

Nikhil Joshi

Vinita Apte

B2B Marketing Strategy | Transformation Coach | Mental Health Content Creator | Certified Cognitive Hypnotic Coach

2 个月

Very informative.

1 次回应

Vasudev BV

2 个月

You bring up an important point about the challenges of regression test prioritization in the context of agility and speed of delivery. It is true that the sheer number of test cases, including regression, integration, and environmental dependencies, can be overwhelming and make it difficult to allocate time effectively. Regression test prioritization is crucial for ensuring that the most critical and high-risk areas of the system are thoroughly tested within the available time frame. By prioritizing test cases based on their impact and risk, organizations can focus their testing efforts on areas that are most likely to be affected by changes and have the highest potential impact on the system's functionality. Additionally, automation can play a significant role in regression test prioritization. By automating repetitive and time-consuming test cases, organizations can free up resources to focus on more critical areas. It is important to strike a balance between agility and thorough regression testing. While speed of delivery is important, it should not come at the expense of quality and risk mitigation. By incorporating regression test prioritization into the development process and leveraging automation tools.

1 次回应

lalit gupta

art historian \ writer

2 个月

Insightful!

1 次回应

Dipanshu Mansingka

Principal Consultant / NITI's AIM/ATL Mentor

2 个月

What i know about from 2006 is that there are about 10 levels of code merge and test gateways at MS before code change reaches the root and then gets propagated back to all the layers. Any change at the lowest level developer has to test and pass the testing gates. Then at next level all the changes will be put together and all tests at that level need to be completed before pushed to level above

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Ensuring System Reliability through Traditional Testing & Quality Engineering: Lessons from the CrowdStrike Outage

Nikhil Joshi

Empowering Digital Transformation through AI Governance

Financial and Operational Impact

Preventive Measures and Lessons Learned

领英推荐

Traditional Testing and Quality Engineering Methods

AI and Governance

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Posti Messaging Oy Uses IBM Spectrum Scale and Elastic Storage System to Reduce CAPEX and OPEX, Improve Performance 2X, and Deliver Disaster Recovery

Application of Secure Software Development Life Cycle (SDLC) for PCI DSS Implementation.

Ensuring 24/7 Operations With Continuous Integration And Continuous Deployment

Site Reliability Engineering and DevOps

Infrastructure as a Code (IaC)

?Troubleshooting Application Performance Woes: Common Culprits & Solutions

4 Reasons to Conduct Legacy Migration: Unlock Potential

Application fleet robustness and resilience: strategic imperatives to protect and grow business

Part 6: Engineering challenges - Connectors

The Importance of Quality Assurance: Lessons from the Global Microsoft Windows Outage

Financial and Operational Impact

Preventive Measures and Lessons Learned

领英推荐

Traditional Testing and Quality Engineering Methods

AI and Governance

Conclusion

Exploring How Strong IT Governance Can Revolutionize ERPs and Access Unprecedented Business Efficiency.

2024年5月24日

AR and VR in Healthcare: A Visionary Leap or a Hurdle to Overcome?

2024年5月14日

Maximizing ROI with SaaS: Choosing the Right Platform for Your Needs

2024年5月7日

From ETL Pipelines to AI-Powered Data Fabric: How Next-Gen Machine Learning Architectures Empower CTOs

2024年5月4日

Introducing FutureFocusFridays!

2024年4月12日

Empowering Growth: Digital Transformation Hacks for Entrepreneurs, Startups, and SMBs

2024年4月11日

Join Our Webinar: Unveiling the Secrets of Digital Transformation Across Industries

2024年4月6日

Beyond Cryptocurrency: Blockchain's Meteoric Rise and Multifaceted Triumphs Across Industries

2024年3月27日

2024: The Shocking Transformation Insights from Consulting Titans Deloitte & Accenture!

2024年3月20日

Tech Trends: Exploring the Digital Landscape with McKinsey and IEEE

2024年3月13日

社区洞察

其他会员也浏览了

Posti Messaging Oy Uses IBM Spectrum Scale and Elastic Storage System to Reduce CAPEX and OPEX, Improve Performance 2X, and Deliver Disaster Recovery

Application of Secure Software Development Life Cycle (SDLC) for PCI DSS Implementation.

Ensuring 24/7 Operations With Continuous Integration And Continuous Deployment

Site Reliability Engineering and DevOps

Infrastructure as a Code (IaC)

?Troubleshooting Application Performance Woes: Common Culprits & Solutions

4 Reasons to Conduct Legacy Migration: Unlock Potential

Application fleet robustness and resilience: strategic imperatives to protect and grow business

Part 6: Engineering challenges - Connectors

The Importance of Quality Assurance: Lessons from the Global Microsoft Windows Outage