How A Little Bird Could Have Prevented The CrowdStrike Disaster

How A Little Bird Could Have Prevented The CrowdStrike Disaster

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on scaling, leading and funding high-growth startups and what the future of innovation looks like. Subscribe here to stay updated.


It’s been called the biggest IT meltdown of all time.?

When cybersecurity titan CrowdStrike released a faulty software update in mid-July, the fallout was catastrophic. Worldwide, 8.5 million devices running Windows crashed, wreaking havoc for banks, airlines, retailers, hospitals and many other organizations.??

Besides inconveniencing millions of people, the disaster sent CrowdStrike shares tumbling and rattled stock markets . For the companies affected, the damage could reach tens of billions of dollars .?

It’s part of a much bigger problem. Poor software quality cost the US economy at least $2.41 trillion in 2022, according to one estimate .?

Word is that CrowdStrike failed to detect flawed code before rolling out its update. As the founder of three companies that help developers deliver better, safer software, I know this blunder didn’t have to happen. In fact, the remedy could have been as low-tech as a canary.??

Let me explain.

The limits of software testing

CrowdStrike is a sophisticated company with a talented engineering team. It goes without saying that they make testing for bugs a priority. After all, no one writes flawless code. So everything engineers create is rigorously tested before release.?

The catch: Inevitably, there are “escaped bugs” that evade testing. It’s important to remember that testing takes place in what is essentially a laboratory environment. Just like something that worked in the lab doesn’t always pan out in the real world, tests can fail to catch errors that cause software to crash.

That’s where a canary could have helped.?

In software, “canary deployment” is a powerfully simple concept. It’s named after the bird that coal miners used to carry into the tunnels with them. That canary served as an early warning system for poisonous gases such as carbon monoxide, which would kill it before the miners, giving them a chance to escape.

Canary deployment, which is very similar, has helped many companies avoid huge headaches. In this case, it means that CrowdStrike would have sent the update to much fewer devices — say, 8,500 rather than 8.5 million. If it worked on that smaller number of machines, they could then share it widely.?

In hindsight, that sounds obvious. So why don’t more companies do it?

Canary deployment can be tough to pull off. Why? The challenge of systematically pushing out updates to small batches of users, monitoring results, and then tweaking code in response. Getting this right requires modern DevOps tools and techniques.

DevOps — short for “development operations” — is a surprisingly new domain. Essentially, DevOps is software for software developers. That’s vital because the enormous complexity of modern development demands automation and streamlining to keep things moving. But on far too many software teams, the tools that can make that happen get overlooked.

Ideally, a DevOps platform provides a safety net to development teams. Using a combination of multiple practices like limited canary releases, incremental rollouts, and feature flags makes software delivery more reliable.?

So your canary didn’t die in the first round? That doesn’t mean full steam ahead.? Follow up by sending the update to 100,000 users, then to 250,000, and so on. For engineering teams, it’s all about limiting the impact blast radius of a bug.

An effective software rollout also needs feature flags. Think of this software development technique as an on/off switch in case things go wrong. When the inevitable bugs get pushed to production, engineers can instantly roll them back. Again, this sounds common-sense — but getting it right on a technical level requires sophisticated engineering. The alternative (which is all too common) is that bugs are rolled back manually, with engineers painfully unravelling the impact of bad code on a piecemeal basis.???

Above all, the CrowdStrike disaster should be a rallying cry for DevOps and for better tools for developers. It’s no exaggeration to say the daily lives of millions, if not billions, of people depend on it.??

A pipeline problem

So what's my No. 1 piece of advice to companies I’m talking with? Automate your pipeline.?

A pipeline is the process that software developers use to go from planning and writing code to testing, deploying and maintaining it. Sounds easy enough, right? Wrong — many developers actually skip parts of the process all together. If the pipeline isn't automated, they can pick and choose which steps they want to follow.

Especially for minor updates, it's human nature to want to move faster. Engineers, in particular, are known to embrace a DIY work ethic and chafe at processes and restrictions. That drive to get code shipped is important and enables rapid progress. But it can also backfire, catastrophically.

The fix: automate the pipeline to make it unskippable, and invest in making sure it speeds up software development rather than becoming a bottleneck.?

When you’re dealing with hundreds or thousands of engineers, this is essential. Ad hoc processes simply can’t be allowed.

Here’s a real-life example. One of our clients is one of the world’s largest banks, Citibank, a highly regulated financial institution with complex software deployment and a ton of security and compliance checks. Self-policing the development process simply isn’t an option at this level.

So to avoid problems, all of the code that their 20,000 developers write goes through an automated pipeline. Those developers are actually happy that the pipeline imposes a rigorous structure because it helps catch errors. At the same time, there’s no way to avoid it.?

But for the pipeline to work, it can’t slow developers down too much. It has to provide a system of checks and balances without compromising speed. Otherwise, developers will find ways to go around it.??

Everything I’ve talked about comes back to modern DevOps tools. In engineering circles, the open secret is that good tools have historically been lacking. Just like the proverbial cobbler’s children with no shoes, engineers today create the software that runs the world, but they often don’t have great software themselves. Instead, they tend to piece together their own DIY toolkit. The automated, seamless assembly line that we imagine development to be rarely exists.?

As we’ve just seen, the consequences can be global in scope. Above all, the CrowdStrike disaster should be a rallying cry for DevOps and for better tools for developers. It’s no exaggeration to say the daily lives of millions, if not billions, of people depend on it.??


Thank you for reading! For more insights from my experience as a serial entrepreneur and how we can harness the power of software to change the world, subscribe to Entrepreneurship and Leadership.

Great point, Jyoti Bansal. This really shows how even top companies can overlook small but critical steps. Canary deployment and automation, point taken. Thanks for sharing!

Vidyanandha Gurukulam

School Teacher at School

2 个月

Upakaram

  • 该图片无替代文字
回复
Neel Shah

Building DevOps Community @ Middleware (YC 23) || Running Observability Meetups || Running @GoogleCloud @CNCF @Docker Communities ||

3 个月

The CrowdStrike incident is a potent reminder of how important observability is to contemporary DevOps workflows. Teams can find problems before they become more serious, spot abnormalities, and get real-time insights into how well a system is working when it is properly observable. Firms may ensure smoother operations and quicker incident resolution by putting strong observability tools and processes in place to stop small errors from becoming big incidents. Similar to a little bird warning of approaching danger, observability offers the early alerts required to preserve system health and prevent expensive downtime.

要查看或添加评论,请登录

Jyoti Bansal的更多文章

社区洞察

其他会员也浏览了