Avoid Trouble Through Operational Resilience
Chris Farrell
Entrepreneur, Observability Product Leader helping companies optimize their use of technology
Discussing the CrowdStrike-induced Windows meltdown and the culpability of our own operations, Mitch Ashley at Devops.com writes that “the CrowdStrike incident highlights the importance of adopting more sophisticated deployment strategies.” Ashley asserts that DevOps teams, in particular, should have more robust and resilient testing / rollout processes such as Red-Black, Blue-Green, Canary, etc.
Depending on the situation, DevOps is the first and last line of defense against something as severe as an operating system shutdown. These teams must act as if the next release of anything will create a problem. The OS kernel failing is an aberrant situation, but DevOps is built around the concept of preparing for aberrant situations.
The problem was (is?) that we’ve only really thought of code updates from that perspective. It makes sense – a piece of code can stress a business quicker than you can say “Git!”
The issue at hand isn’t that platform update processes need more resilience – they do. The biggest problems caused by the CrowdStrike kernel crash weren't caused by missing technical resiliency, but by a lack of business resiliency. Specifically, the inability to restore an application's state after a crash or outage.
A more sophisticated deployment strategy could have kept application availability up higher, but companies that lacked operational resilience were the biggest losers. Thus, while you should be more cautious in IT deployments, to truly safeguard the business, make the company more resilient across the board.