Managing Risks in Software Upgrades: Lessons from the CrowdStrike Outage

Managing Risks in Software Upgrades: Lessons from the CrowdStrike Outage


Last week’s CrowdStrike outages were hard to miss, even if you were living on a proverbial desert island. These incidents highlight the inherent risks involved in upgrading software, a challenge that's becoming increasingly complex with the rise of cloud services.

The CrowdStrike scenario underscores the critical need for robust service maintenance. A typical software stack, which includes all the components necessary to deliver services to users and clients, comprises various layers. Here, we focus on management tools. These tools, often installed as drivers, sit atop operating systems and have access to the most sensitive parts of the system and hardware.


The notorious Blue Screen of Death (BSOD) is often caused by faulty drivers because they have privileged access to critical server components. While most applications are shielded from causing such catastrophic failures, security-related software is an exception. These components are essential for protecting services from threats like malware, viruses, and hackers. In this case, it was an antivirus/endpoint protection application that led to the issue.

Security applications are updated in two distinct ways:

1. Real-Time Updates: These include new virus signatures and heuristics or rule sets that detect suspicious behaviour. These updates occur frequently, often more than once a day, reflecting the relentless efforts of cybercriminals to compromise data and disrupt services.

2. Periodic Updates: These involve adding new capabilities to the software and occur less frequently. They often require machine reboots, contributing to the risk.

In the recent outage, the update process failed, deploying a defective update that crashed machines upon reboot. While such risks have always existed, good vendor engineering practices typically make these events rare. However, this incident raises questions about whether the patching processes of many affected organisations were sufficiently robust.

While perfect security and resilience are unattainable, practical steps can minimise the risk of large-scale outages:

1. Staggered Updates: Security product updates should first be tested in a dev or test environment. If issues like blue screens occur, the update must be halted to ensure production stability. It is possible to build staggered updates, it just takes more planning and testing.

2. Effective Restore Mechanisms: Backup discussions often overlook their sole purpose of enabling the restoration of a server or service. Ensure your restore mechanisms work flawlessly by regularly testing your ability to revert to previous backups or snapshots. This is your ultimate safety net.

These practices apply to in-house as well as cloud hosting, but fundamentally, success hinges on adhering to basic, sound practices.

---


要查看或添加评论,请登录

Steve Butler的更多文章

社区洞察

其他会员也浏览了