Lessons from the CrowdStrike Incident: 
Enhancing Organizational Cybersecurity

Lessons from the CrowdStrike Incident: Enhancing Organizational Cybersecurity

By Donna Gallaher


In July 2024, the cybersecurity community was shaken by an incident involving CrowdStrike, a leading provider of endpoint protection solutions. A file containing NULL bytes was pushed to their agents, causing widespread system crashes among their Windows-based clients. This incident serves as a stark reminder that even the most trusted and sophisticated security solutions can become vectors of disruption. This whitepaper examines the incident, its implications, and provides comprehensive recommendations for organizations to enhance their cybersecurity practices in light of this event.

Introduction

The increasing complexity of modern IT environments has led many organizations to rely heavily on advanced security solutions like those provided by CrowdStrike. These tools, while essential for protecting against a wide array of threats, also introduce new risks when they malfunction. The July 2024 CrowdStrike incident highlighted the potential for widespread disruption when a trusted security provider experiences issues. This whitepaper aims to analyze the incident, extract key lessons, and provide actionable strategies for organizations to improve their overall cybersecurity posture and resilience against similar events.

Root Cause of the Incident

CrowdStrike's Falcon platform is a cloud-native endpoint protection solution that operates with kernel-level access on client systems. This deep integration allows for comprehensive threat detection and prevention but also means that any issues with the software can have significant impacts on system stability. In the July 2024 incident, CrowdStrike distributed an update containing a file full of NULL bytes to its Windows-based clients. This caused "Blue Screen of Death" (BSoD) errors on affected systems, leading to widespread disruptions across multiple industries. Because CrowdStrike is recognized as an industry leader, many critical services such as transportation, hospital, and banking services were interrupted.

The incident underscores the double-edged nature of advanced security solutions and trusted security partners. While they provide crucial protection against a myriad of threats, their privileged position within IT environments means that any malfunction in these highly trusted applications can have far-reaching consequences. It also highlights the challenges associated with rapid, automated update processes, which, while essential for maintaining up-to-date security, can also serve as a vector for distributing problematic code.

Key Lessons and Mitigation Strategies

The CrowdStrike incident offers several crucial lessons for organizations seeking to enhance their cybersecurity practices. The main takeaways and recommendations are:

Strategic Diversification of the Security Stack

Over-reliance on a single provider can create significant risks. Although vendors commonly offer attractive discounts on "bundled services," these cost savings must be weighed against the risk of concentrating critical services into a single point of failure with the vendor. The ongoing cost of supporting multiple tools is generally considered an area where consolidation is desirable, so companies should be very deliberate when making decisions regarding diversifying their technology stack.

Organizations should consider implementing a diverse security ecosystem, potentially using different security vendors for various business units or functions. For instance, one could use CrowdStrike for endpoint protection in one division and alternatives like Sophos or SentinelOne in others. For critical applications, companies may want to utilize different operating systems in case the primary one is compromised. In the case of CrowdStrike, only Windows systems were impacted. If companies were able to fail over to macOS or Linux platforms, they were able to continue operations. This approach ensures that a failure in one security tool doesn't compromise the entire organization. Similarly, using separate password management solutions for regular users and administrators, or using separate service providers for internet and cell phone services, supports strategic security stack diversification.

Rigorous Vendor Due Diligence

Thorough vendor due diligence is critical, even for industry-leading providers. Although small companies may assume that the vendor's security program is stronger than theirs and there is little a small company can do to influence the terms of the contract with an industry giant like Microsoft, CrowdStrike, or Google, understanding the risk associated with the vendor defines the plan for the compromise of those third parties. Too often, vendors are selected based on features and price, but the due diligence process should factor in other considerations. While usability and functionality of the security solution itself are crucial factors in vendor selection, arguably there may be more important criteria. "Bleeding edge" solutions are often high risk because of the experience of the leadership team, high turnover in key positions, compliance with regulations, unstable financial position, or other risks that may ultimately impact the availability of the service.

Compliance with regulations and industry standards is a critical factor in vendor due diligence since regulatory fines and lawsuits can cost as much if not more than a surprise ransomware attack. It doesn't matter if a company pays a criminal a ransom for a weak security program or if a company pays a hefty fine to a regulator for a weak compliance program. The impact on the bottom line is the same – a large financial loss that may impact the vendor's ability to ultimately deliver services.

A vendor's market position provides valuable insights into their capabilities and stability even if the features or solution design is less attractive than competitors. When evaluating potential vendors, organizations should consider the provider's market share and leadership in their field. This information can indicate the vendor's ability to innovate and support their products long-term. However, CrowdStrike is recognized as an industry leader, so their customers were not prepared for them to be compromised. Other mitigation strategies are required.

Contract terms are one of the most critical considerations in vendor selection because vendors rely on existing contracts for forecasting revenue and company valuation. Vendors insert termination penalties, automatic price increases, and multi-year term lengths to retain clients as long as possible while maximizing revenue, and vendors will prioritize contract compliance to ensure they retain their client base. Companies can protect themselves by negotiating termination clauses which will allow them to terminate a contract without penalty for a breach of the contract. These terms may include defining roles & responsibilities, Service Level Agreements (SLAs) and Key Performance Indicators (KPIs) and their associated periodic performance reports to their clients, incident notification procedures, and clear contact instructions for service escalation.

Implementation, Configuration, and Management of Security Tools

Effective implementation and management of security tools are crucial for planning for a compromise of a trusted third party like CrowdStrike.

This encompasses several key areas:

1.?????? Patch Management and Versioning: While it's important to keep security tools up-to-date, organizations should also consider the risks associated with automatic updates or allowing computers on a network to update each other automatically if internet service is terminated. Implementing a phased approach to updates, especially for critical systems, can help mitigate the risk of widespread disruption. This could involve maintaining a subset of systems on slightly older, stable versions as a fallback. A well-thought-out patch management strategy balances the need for the latest security updates with system stability and operational continuity.

2.?????? Network Segmentation: This plays a vital role in containing the potential impact of security issues. By dividing the network into separate segments, organizations can limit the spread of problems, whether they originate from malicious attacks or malfunctioning security tools.

3.?????? System Configuration: Proper configuration of security tools is essential to ensure they function as intended. This includes setting up appropriate rules, policies, and alerts tailored to your organization's specific needs and risk profile. Standardizing on one or several standard system configuration(s) saves time when troubleshooting a security incident since technical teams are able to perform triage activities based on a known starting point and not waste valuable time discovering the differences between different workstations as they respond to a company-wide incident.

4.?????? Access Controls: Implement the principle of least privilege, restricting access rights for users, processes, and applications to the minimum necessary to perform their functions. Vendors may design their systems to require more access than is necessary or to retain administrative access after the tool has been implemented into a client's production environment. Limiting access rights significantly reduces the attack surface and limits the potential damage from security incidents or compromised vendors.

5.?????? Implementation Strategy: When deploying new security tools, consider a phased rollout approach. This allows for testing and adjustment in a limited environment before full-scale deployment, reducing the risk of widespread issues.

Together, these strategies form a comprehensive approach to implementing and managing security tools, ensuring they provide maximum protection while minimizing potential risks and operational disruptions.

Comprehensive Incident Management Planning

Organizations need robust plans for scenarios where critical security services become unavailable – even from trusted vendors. These plans should include clear procedures for detecting service outages, implementing alternative security measures, and communicating with stakeholders. A crucial component of incident management is the ability to quickly identify and implement vendor-provided workarounds. Organizations should establish a process to actively monitor vendor communication channels for updates and workarounds, assigning specific team members to this task and creating systems to quickly evaluate, test, and disseminate workaround instructions to relevant staff.

As part of incident management planning, establishing alternate communication channels is vital. In today's interconnected business environment, a compromise in communication systems can severely hamper incident response efforts. Organizations should set up backup communication platforms from different vendors. For instance, if Microsoft tools are the primary means of communication, consider setting up accounts on platforms like Slack or Discord as backups. It's important to develop and document procedures for switching to these alternate channels and to regularly test them to ensure all employees can access and use them effectively. This redundancy in communication systems can be crucial during critical incidents.

Cybersecurity is an ever-evolving field, and organizations must continuously adapt their strategies and "lessons learned" into their incident response plans and playbooks. Regular testing of incident response plans through tabletop exercises and simulations is essential. These exercises should include scenarios where multiple services or vendors are unavailable simultaneously. Ongoing security awareness training and simulated attacks, such as phishing tests, help maintain a security-conscious culture throughout the organization. By fostering a culture of continuous learning and adaptation, organizations can stay ahead of emerging threats and build robust, resilient cybersecurity practices.

Conclusion

The CrowdStrike incident of July 2024 serves as a valuable lesson in the complexities of modern cybersecurity. It highlights the potential risks associated with advanced security solutions and the need for a multi-layered, resilient approach to cybersecurity. By implementing the strategies outlined in this whitepaper – from diversifying security stacks to improving incident response plans – organizations can enhance their resilience against a wide range of potential security incidents.

As the cybersecurity landscape continues to evolve, so too must our approaches to managing and mitigating risks. The recommendations provided here should not be viewed as a one-time implementation, but rather as part of an ongoing process of evaluation, adaptation, and improvement. By fostering a culture of continuous learning and adaptation, organizations can stay ahead of emerging threats and build robust, resilient cybersecurity practices.

The CrowdStrike incident reminds us that in the realm of cybersecurity, we must always expect the unexpected. By preparing for a wide range of scenarios, including those involving trusted security providers, organizations can better position themselves to respond effectively to whatever challenges may arise. In doing so, they not only protect their own interests but contribute to the overall security and stability of our increasingly interconnected digital ecosystem.

#? #? #

Diane Nix

Transforming lives through optimal health and wellness coaching.

3 个月

Very informative Donna G.!!

??Scott Stanton

Cybersecurity Director and BISO at Owens & Minor

3 个月

Hi Donna - that's a thoughtful article. a few comments.. 1) The suggestion of controls diversification rings a bit of the duplicative controls anti-pattern (https://www.ncsc.gov.uk/whitepaper/security-architecture-anti-patterns). While not exactly the same as running layered duplicative controls, the intent is the same; controls diversification to avoid single point of failure. 2) The Crowdstrike incident is an inevitability of widely-adopted technology. We've seen many, many situations where widespread outages were caused by "update" over the decades. The only way to adequately manage this "high-impact, low-probability" event is through risk transference (insurance). 3) Contract liability will determine the victims' ability to sue Crowdstrike for damages resulting from this incident. Typically damages are limited to the value of the contract, but savvy (or high-leverage) customers will ask for value multiples or even unlimited damages. 4) Crowdstrike's Patrick McCormack posted a really good de-jargoned explanation of the incident. It's worth a read by anyone who's following the situation, as there's a lot of misinformation out there. https://www.dhirubhai.net/feed/update/urn:li:activity:7224859962806599683/.

Michele Banish

Retired, Founder & CEO of Polaris Sensor Technologies Inc

3 个月

Love good work! Thanks for lessons Donna!

Ivens Mendonca

COO | CIO | Operations Director | Technology Director | DevOps | Operations & Technology Executive | Agile Project Management - PMO | Non-Profit Executive

3 个月

Thanks Donna. Incidents can come from anywhere, including from those that are meant to prevent (mitigate) them in the first place! Very good insights, thank you.

Greg T.

Founder and CEO Cybersecurity Consulting & Recruitment

3 个月

Donna G. The CrowdStrike outage highlights a critical oversight that could have been avoided with proper implementation of a zero-trust model and thorough pre-production testing. By adhering to the principle of "trust but verify," those responsible for the IT systems could have prevented such a widespread disruption. Ensuring updates are rigorously tested in a pre-production environment before deployment is essential for maintaining system integrity and avoiding scenarios like this. Moreover, there are technologies available today, such as Abatis (www.abatis.ch), that make emergency patching a thing of the past. These solutions provide IT managers and CISOs with the time and flexibility to handle updates properly and in an orderly manner, reducing the risk of untested patches causing significant issues. This incident serves as a crucial reminder of the importance of robust security protocols, diligent testing procedures, and the adoption of advanced technologies to ensure system resilience.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了