Managing Risks in Software Upgrades: Lessons from the CrowdStrike Outage
Steve Butler
?? Technical Programme Manager | IT Transformation Director | Cloud Migration | Data Centre | Application Modernisation | Cost Optimisation - Delivered £30M in Savings | Author | Speaker
Last week’s CrowdStrike outages were hard to miss, even if you were living on a proverbial desert island. These incidents highlight the inherent risks involved in upgrading software, a challenge that's becoming increasingly complex with the rise of cloud services.
The CrowdStrike scenario underscores the critical need for robust service maintenance. A typical software stack, which includes all the components necessary to deliver services to users and clients, comprises various layers. Here, we focus on management tools. These tools, often installed as drivers, sit atop operating systems and have access to the most sensitive parts of the system and hardware.
The notorious Blue Screen of Death (BSOD) is often caused by faulty drivers because they have privileged access to critical server components. While most applications are shielded from causing such catastrophic failures, security-related software is an exception. These components are essential for protecting services from threats like malware, viruses, and hackers. In this case, it was an antivirus/endpoint protection application that led to the issue.
Security applications are updated in two distinct ways:
1. Real-Time Updates: These include new virus signatures and heuristics or rule sets that detect suspicious behaviour. These updates occur frequently, often more than once a day, reflecting the relentless efforts of cybercriminals to compromise data and disrupt services.
领英推荐
2. Periodic Updates: These involve adding new capabilities to the software and occur less frequently. They often require machine reboots, contributing to the risk.
In the recent outage, the update process failed, deploying a defective update that crashed machines upon reboot. While such risks have always existed, good vendor engineering practices typically make these events rare. However, this incident raises questions about whether the patching processes of many affected organisations were sufficiently robust.
While perfect security and resilience are unattainable, practical steps can minimise the risk of large-scale outages:
1. Staggered Updates: Security product updates should first be tested in a dev or test environment. If issues like blue screens occur, the update must be halted to ensure production stability. It is possible to build staggered updates, it just takes more planning and testing.
2. Effective Restore Mechanisms: Backup discussions often overlook their sole purpose of enabling the restoration of a server or service. Ensure your restore mechanisms work flawlessly by regularly testing your ability to revert to previous backups or snapshots. This is your ultimate safety net.
These practices apply to in-house as well as cloud hosting, but fundamentally, success hinges on adhering to basic, sound practices.
---