Prepare to fail: Lessons from a global outage
Thomas Murray
Global Risk Intelligence | Safeguarding clients and their communities since 1994
The July 2024 Microsoft global outage, caused by a defective software update from CrowdStrike, has highlighted the critical importance of robust incident and crisis management plans. This event, which led to widespread disruptions across multiple sectors, underscores the need to perform comprehensive preparation for when critical systems fail, particularly when it comes to security monitoring and response tools.?
The incident?
A defective update from CrowdStrike caused a logic error within the Windows sensor client, resulting in the "blue screen of death" on approximately 8.5 million devices around the world. This disruption impacted sectors from air travel to healthcare, causing thousands of flight cancellations and service interruptions in hospitals.?
Key lessons and preparations?
Rigorous testing and staging?
Ensure all updates, especially those affecting critical infrastructure, undergo extensive testing in controlled environments before deployment. Implement a staged rollout to catch issues early before they affect a wide user base.?
Patch and update management?
Employ patch management strategies, such as ring models and N-1 policies. Ring models allow updates to be deployed in stages, starting with a small group of users before a wider rollout, thus mitigating the risk of widespread failure. ?
The N-1 approach involves delaying the installation of updates until they have been proven stable by earlier adopters, reducing the likelihood of introducing critical issues into your environment.?
Robust backup and recovery plans?
Maintain comprehensive backup and recovery plans that include regular testing and validation. Ensure that recovery processes can be executed quickly to minimise downtime.?
Incident response readiness?
Develop and regularly update incident response plans. Train staff to handle scenarios where security monitoring and response tools are compromised. Simulate incidents to test the effectiveness of these plans.?
Develop plans to maintain business operations when key technologies are no longer available, assume that manual operations will take precedence and ensure you understand the human resource pressures and constraints that will result.?
Regulatory compliance and preparedness?
Adhere to regulations such as the Digital Operational Resilience Act (DORA) and the Network and Information Systems Directive (NIS2). Frameworks and regulations like these mandate resilience and response strategies for critical infrastructure and essential services.?
Communication strategies?
Establish clear communication protocols to inform stakeholders, including customers and regulatory bodies, during an incident. Transparency and timely updates can mitigate reputational damage.?
Regulatory insights?
DORA mandates financial institutions within the EU to establish robust information and communication technology (ICT) risk management, incident reporting, and testing protocols. Compliance with DORA ensures financial entities are better prepared for operational disruptions. The Act requires institutions to develop comprehensive risk management frameworks that encompass: ?
领英推荐
Regular testing of these frameworks through threat-led penetration testing (TLPT) and red teaming exercises is also mandated to ensure resilience against cyber threats.?
DORA also emphasises the importance of having an ICT third-party risk management strategy. Financial entities must ensure that their third-party service providers adhere to similar resilience standards, thus mitigating risks that could arise from outsourced services. This holistic approach ensures that the financial sector maintains operational continuity and security in the face of IT disruptions.?
NIS2 extends requirements to a broader range of sectors, emphasising the need for enhanced cyber resilience and incident response capabilities. Organisations under NIS2 must implement measures to prevent and handle incidents, ensuring continuity of essential services. This directive focuses on: ?
Under NIS2, entities must conduct regular risk assessments and implement appropriate security measures, including incident response and business continuity plans. They are also required to perform periodic audits and testing of their security measures to ensure effectiveness. This includes vulnerability assessments, penetration testing, and continuous monitoring of networks and systems to detect and respond to threats promptly.?
Moving forward?
The CrowdStrike incident serves as a stark reminder of the vulnerabilities inherent in our reliance on technology. Businesses must adopt a proactive approach to risk management and resilience, ensuring they are prepared to handle failures in critical systems. ?
By adhering to regulatory requirements and implementing comprehensive crisis management strategies, organisations can better safeguard against similar disruptions in the future.?
Did you know?
Cyber Risk
We bring the best of our collective experience, energy and creative power to fiercely safeguard our clients and fortify their communities. Learn more