Introduction
On 19 July 2024, a global outage affected Windows servers, virtual machines, and endpoints that used CrowdStrike, a leading endpoint protection and incident response platform. The outage was caused by a faulty Falcon Content Update that triggered a "blue screen of death" or an inoperable system on the affected machines. CrowdStrike has issued a workaround that requires booting each machine into safe mode and recovering manually, which can be challenging for organizations with large or distributed fleets of devices. In addition, organizations using full-disk encryption software must retrieve each machine's recovery key, adding another layer of complexity and risk.
Immediate Actions
IT leaders and security professionals should prioritize ensuring the operational continuity of their PCs, staff, and businesses. According to Gartner, a leading research and advisory company, the immediate actions to take in the first one to seven days after the outage are:
- Alert and engage the incident response and crisis management teams and use appropriate crisis communications to notify employees, clients, and critical third parties of potential disruptions.
- Verify that any information (internal and external) is coming from authoritative sources to avoid the risk of secondary cyberattacks.
- Mobilize prede?ned crisis management teams for immediate action to prevent user mistakes, including self-service remediation actions from untrusted sources.
- Designate a communications team as a point of contact for internal communication with other stakeholders to minimize disruptions and ensure consistent communication.
- Involve the security operation teams in monitoring for new threat intelligence related to opportunistic attacks, alerts from anomaly detection systems, and other unusual activities.
- Leverage IT technical professionals or delegated IT experts to help PC end users by following the published workaround.
- Use these experts to provide support without granting users direct access to recovery tools or elevated privileges.
Midterm Actions
The next step for IT leaders and security professionals is to assess the impact on secondary systems, look for exposed vulnerabilities, and ensure they have visibility in planned systemwide updates and releases in the coming weeks. According to Gartner, the midterm actions to take in the first one to two weeks after the outage are:
- Establish a triage process to categorize assets and business processes based on the impact of the disruption and the complexity of remediations, create prioritized remediation plans based on these assets, identify potential side effects and unintended consequences of remediation actions, and identify "straggler" machines that may have the offending driver but have not yet been identi?ed in the ?rst wave of remediations.
- Avoid overreactions, such as an immediate mandate to decommission, disable, or replace CrowdStrike. Instead, defer to the post-incident review process and the existing vendor risk management process to manage this strategic decision.
- Review anomalies or unusual trends with the SOC teams to minimize the risks of an undetected opportunistic attack.
- Participate in the business impact analysis to provide the security viewpoint and ensure balanced discussions about what to do next for potential impacts on the security posture.
Long-Term Actions
The final stage for IT leaders and security professionals is to mitigate or reduce the risk of the same business impact or exposure caused by the CrowdStrike outage. According to Gartner, the long-term actions to take in the first eight to 12 weeks after the outage are:
- Inform senior leadership across the organization of the status of PCs and the continuing efforts to stabilize the environment and restore trust. Indicate that teams are working on long-term plans to avoid similar disruptions in the future.
- Check agent automatic update settings for your endpoint protection tool. Ensure the settings are consistent with your existing organizational change control policy and the desired state to match your organization’s risk tolerance. Ensure any vulnerability patching is thoroughly tested prior to deployment. As a best practice, stage updates in increments to avoid 100% failure. In addition, check with vendors to ensure all updates honor the staged update policy.
- Actively manage burnout/fatigue in your team because fatigue increases the risk of error. Consider rotating operational staff and, in collaboration with HR, providing resources to alleviate stress.
- Review prevention, response, and support procedures for large-scale outages. Many organizations report being unable to handle the sudden high volume of support requests.
- Check and update downtime procedures for critical operations and revise crisis communication plans, incident response processes, and business continuity management/IT disaster recovery plans accordingly.
- Ensure key employees with response and recovery responsibilities have the necessary competencies and are involved in testing enterprise systems.
- The CrowdStrike outage reinforces the need to focus on resilience. Use a top-down approach to connect the approach to overall strategic objectives.
Conclusion
The recent CrowdStrike outage has highlighted the importance of resilience in the face of cyberattacks and other disruptions. Organizations need to adopt a top-down approach that aligns their resilience strategy with their overall business objectives and considers the potential impact of different scenarios on their operations, reputation, and stakeholders. To achieve this, organizations should:
- Identify and prioritize the critical systems or organizational areas most vulnerable or essential for their continuity and recovery.
- Develop and test downtime procedures for these critical areas, ensuring that they have adequate backups, alternatives, or workarounds in case of an outage.
- Revise their crisis communication plans, incident response processes, and business continuity management/IT disaster recovery plans to reflect the current threat landscape and best practices.
- Train and empower key employees with response and recovery responsibilities, ensuring they have the necessary competencies, resources, and authority to act swiftly and effectively.
- Coordinate and communicate well between different teams, departments, and external partners involved in the resilience process, fostering a culture of trust, collaboration, and learning.
By following these steps, organizations can enhance their resilience and prepare themselves for future challenges.