Crowdstrike / Microsoft Outage Analysis
Shashank Bajpai
Cybersecurity Evangelist | LinkedIn Community Top Voice | Cloud Security SME | Cyber Risk & Compliance | Speaker, Influencer & Writer
Background:
As a technology professional, I found it imperative to delve deeply into the technical issues behind the global outage of a well-known EDR product on July 19, 2024. Having started my career as a system and driver developer at CDAC and now heading operations and engineering for Cybersecurity at YOTTA, I felt this analysis was crucial to determine the root cause and identify areas for improvement.
Online, numerous discussions emphasize the importance of Code-Process Governance and Supply Chain Risk Mitigation. However, despite the technical remedies proposed by the product owner, there is no precise identification of the root cause that triggered the domino effect, culminating in a global outage. Nonetheless, I obtained valuable insights from Zach Vorhies via Twitter, Inc. (X) and Praveen Singh via Social Media platforms. Here, I am compiling and sharing this information in two key sections: Root Cause Analysis and Areas for Improvement.
Root Cause Analysis:
As we are aware, in C++, programmers are required to validate objects before passing them to ensure they are not null. This is typically done by performing a null check. In the below attached image of stack dump from the Crowdstrike file, an attempt is being made to read memory at address 0x9C, which is 156 in decimal. This suggests that the programmer failed to verify the validity of the object, leading to an attempt to access a member variable of a null object. Calculating the address: NULL + 0x9C = 0x9C = 156.
This points to an invalid memory region. The critical issue here is that the program in question is a system driver, which has privileged access to the system. Due to this elevated privilege, the operating system must preemptively crash to avoid potential security risks, resulting in a blue screen of death (BSOD). While a crash in non-privileged code can often be mitigated by terminating the offending application, a crash in a system driver requires an immediate halt to protect the system. Thus, system driver crashes are responsible for the majority of BSODs.
领英推荐
Areas of Improvement:
Technology in Focus -
To address this, Microsoft should implement stricter policies for rolling back defective drivers instead of deploying risky updates. CrowdStrike, in response, should most likely enhance their code safety protocols by integrating code sanitization tools that can automatically detect such issues.
Furthermore, CrowdStrike may consider rewriting their system driver from C++ to a more modern language like Rust, which inherently prevents such types of null pointer dereference errors due to its strong emphasis on memory safety.
Process in Focus:
Numerous online articles provide insights on preventing global outages that affect technical ecosystems. While these resources offer various strategies that can be refined and implemented, the key recommendations are:
Diversify Cybersecurity Infrastructure: Mitigate the risks of relying on a single vendor for cybersecurity solutions by diversifying the cybersecurity infrastructure. This strategy enhances resilience and reduces dependency.
Critically Assess SaaS Providers: Evaluate software-as-a-service providers with caution. Thoroughly assess their security, compliance, and reliability measures to ensure they meet rigorous standards.
Prevent Future Crises through Post-Incident Reviews: Conduct comprehensive post-incident reviews to identify root causes and implement corrective actions. This process is essential for preventing future crises and fostering a culture of continuous improvement.
Really insightful Shashank! The article below (put out by NeuShield ) additionally covers the loopholes in the face of a disaster! Please have a look and share inputs if any.. https://www.msn.com/en-us/news/technology/life-interrupted-how-crowdstrikes-patch-failure-is-messing-up-the-world/ar-BB1qhC45
Here the REAL root cause of CrowdStrike disaster: Microsoft driver certification bypass. Here explained in Spanish: https://lnkd.in/dqXzUKex Technical details in English: https://lnkd.in/dgu9m_Hq