登录查看更多内容

How A Little Bird Could Have Prevented The CrowdStrike Disaster

Jyoti Bansal

Entrepreneur | Dreamer | Builder. Founder at Harness, Traceable, AppDynamics & Unusual Ventures

发布日期: 2024年7月26日

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on scaling, leading and funding high-growth startups and what the future of innovation looks like. Subscribe here to stay updated.

It’s been called the biggest IT meltdown of all time.?

When cybersecurity titan CrowdStrike released a faulty software update in mid-July, the fallout was catastrophic. Worldwide, 8.5 million devices running Windows crashed, wreaking havoc for banks, airlines, retailers, hospitals and many other organizations.??

Besides inconveniencing millions of people, the disaster sent CrowdStrike shares tumbling and rattled stock markets . For the companies affected, the damage could reach tens of billions of dollars .?

It’s part of a much bigger problem. Poor software quality cost the US economy at least $2.41 trillion in 2022, according to one estimate .?

Word is that CrowdStrike failed to detect flawed code before rolling out its update. As the founder of three companies that help developers deliver better, safer software, I know this blunder didn’t have to happen. In fact, the remedy could have been as low-tech as a canary.??

Let me explain.

The limits of software testing

CrowdStrike is a sophisticated company with a talented engineering team. It goes without saying that they make testing for bugs a priority. After all, no one writes flawless code. So everything engineers create is rigorously tested before release.?

The catch: Inevitably, there are “escaped bugs” that evade testing. It’s important to remember that testing takes place in what is essentially a laboratory environment. Just like something that worked in the lab doesn’t always pan out in the real world, tests can fail to catch errors that cause software to crash.

That’s where a canary could have helped.?

In software, “canary deployment” is a powerfully simple concept. It’s named after the bird that coal miners used to carry into the tunnels with them. That canary served as an early warning system for poisonous gases such as carbon monoxide, which would kill it before the miners, giving them a chance to escape.

Canary deployment, which is very similar, has helped many companies avoid huge headaches. In this case, it means that CrowdStrike would have sent the update to much fewer devices — say, 8,500 rather than 8.5 million. If it worked on that smaller number of machines, they could then share it widely.?

In hindsight, that sounds obvious. So why don’t more companies do it?

Canary deployment can be tough to pull off. Why? The challenge of systematically pushing out updates to small batches of users, monitoring results, and then tweaking code in response. Getting this right requires modern DevOps tools and techniques.

DevOps — short for “development operations” — is a surprisingly new domain. Essentially, DevOps is software for software developers. That’s vital because the enormous complexity of modern development demands automation and streamlining to keep things moving. But on far too many software teams, the tools that can make that happen get overlooked.

Fruition IT 5 个月前

I Really Shouldn’t Have Agreed to Variable Rate…

David Spark 6 个月前

Resilient IT: Supporting IT Professionals through…

Andre Ripla PgCert 1 个月前

Ideally, a DevOps platform provides a safety net to development teams. Using a combination of multiple practices like limited canary releases, incremental rollouts, and feature flags makes software delivery more reliable.?

So your canary didn’t die in the first round? That doesn’t mean full steam ahead.? Follow up by sending the update to 100,000 users, then to 250,000, and so on. For engineering teams, it’s all about limiting the impact blast radius of a bug.

An effective software rollout also needs feature flags. Think of this software development technique as an on/off switch in case things go wrong. When the inevitable bugs get pushed to production, engineers can instantly roll them back. Again, this sounds common-sense — but getting it right on a technical level requires sophisticated engineering. The alternative (which is all too common) is that bugs are rolled back manually, with engineers painfully unravelling the impact of bad code on a piecemeal basis.???

Above all, the CrowdStrike disaster should be a rallying cry for DevOps and for better tools for developers. It’s no exaggeration to say the daily lives of millions, if not billions, of people depend on it.??

A pipeline problem

So what's my No. 1 piece of advice to companies I’m talking with? Automate your pipeline.?

A pipeline is the process that software developers use to go from planning and writing code to testing, deploying and maintaining it. Sounds easy enough, right? Wrong — many developers actually skip parts of the process all together. If the pipeline isn't automated, they can pick and choose which steps they want to follow.

Especially for minor updates, it's human nature to want to move faster. Engineers, in particular, are known to embrace a DIY work ethic and chafe at processes and restrictions. That drive to get code shipped is important and enables rapid progress. But it can also backfire, catastrophically.

The fix: automate the pipeline to make it unskippable, and invest in making sure it speeds up software development rather than becoming a bottleneck.?

When you’re dealing with hundreds or thousands of engineers, this is essential. Ad hoc processes simply can’t be allowed.

Here’s a real-life example. One of our clients is one of the world’s largest banks, Citibank, a highly regulated financial institution with complex software deployment and a ton of security and compliance checks. Self-policing the development process simply isn’t an option at this level.

So to avoid problems, all of the code that their 20,000 developers write goes through an automated pipeline. Those developers are actually happy that the pipeline imposes a rigorous structure because it helps catch errors. At the same time, there’s no way to avoid it.?

But for the pipeline to work, it can’t slow developers down too much. It has to provide a system of checks and balances without compromising speed. Otherwise, developers will find ways to go around it.??

Everything I’ve talked about comes back to modern DevOps tools. In engineering circles, the open secret is that good tools have historically been lacking. Just like the proverbial cobbler’s children with no shoes, engineers today create the software that runs the world, but they often don’t have great software themselves. Instead, they tend to piece together their own DIY toolkit. The automated, seamless assembly line that we imagine development to be rarely exists.?

As we’ve just seen, the consequences can be global in scope. Above all, the CrowdStrike disaster should be a rallying cry for DevOps and for better tools for developers. It’s no exaggeration to say the daily lives of millions, if not billions, of people depend on it.??

Thank you for reading! For more insights from my experience as a serial entrepreneur and how we can harness the power of software to change the world, subscribe to Entrepreneurship and Leadership.

Entrepreneurship & Leadership

36,881 位关注者

Evan Hecht

2 个月

Great point, Jyoti Bansal. This really shows how even top companies can overlook small but critical steps. Canary deployment and automation, point taken. Thanks for sharing!

1 次回应

Vidyanandha Gurukulam

School Teacher at School

2 个月

Upakaram

Neel Shah

Building DevOps Community @ Middleware (YC 23) || Running Observability Meetups || Running @GoogleCloud @CNCF @Docker Communities ||

3 个月

The CrowdStrike incident is a potent reminder of how important observability is to contemporary DevOps workflows. Teams can find problems before they become more serious, spot abnormalities, and get real-time insights into how well a system is working when it is properly observable. Firms may ensure smoother operations and quicker incident resolution by putting strong observability tools and processes in place to stop small errors from becoming big incidents. Similar to a little bird warning of approaching danger, observability offers the early alerts required to preserve system health and prevent expensive downtime.

1 次回应

查看更多评论

要查看或添加评论，请登录

Jyoti Bansal的更多文章

CEO Rapidfire: Superhuman CEO Rahul Vohra On Why You Shouldn’t Fail Fast

2024年11月21日

CEO Rapidfire: Superhuman CEO Rahul Vohra On Why You Shouldn’t Fail Fast

Welcome to CEO Rapidfire, fast-paced questions with today’s most successful founders and CEOs. Be sure to look for…

10 条评论
Our Research Shows Software Outages Now Impact Half The Population. It’s Time To Act

2024年11月12日

Our Research Shows Software Outages Now Impact Half The Population. It’s Time To Act

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on…

7 条评论
Toil Is Still Hurting Software Developers. Here’s How To Fight Back

2024年10月17日

Toil Is Still Hurting Software Developers. Here’s How To Fight Back

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on…

11 条评论
CEO Rapidfire: Monte Carlo's Barr Moses On Why Tech Trends Are Like Avocados

2024年9月26日

CEO Rapidfire: Monte Carlo's Barr Moses On Why Tech Trends Are Like Avocados

Welcome to CEO Rapidfire, fast-paced questions with today’s most successful founders and CEOs. Be sure to look for…

20 条评论
Sentry CEO Milin Desai On What Leaders Can Learn From Marvel’s Ant-Man

2024年8月15日

Sentry CEO Milin Desai On What Leaders Can Learn From Marvel’s Ant-Man

Welcome to CEO Rapidfire, fast-paced questions with today’s most successful founders and CEOs. Be sure to look for…

12 条评论
The 3-Layer Cake Strategy To Building A Company That Lasts

2024年7月18日

The 3-Layer Cake Strategy To Building A Company That Lasts

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on…

32 条评论
CEO Rapidfire: What Zeb Evans’ 4 Near-Death Experiences Taught Him About Leadership

2024年5月23日

CEO Rapidfire: What Zeb Evans’ 4 Near-Death Experiences Taught Him About Leadership

Welcome to CEO Rapidfire, fast-paced questions with today’s most successful founders and CEOs. Be sure to look for…

13 条评论
AI Is Putting Pressure On Software QA. Here’s How Your Team Can Adapt

2024年4月25日

AI Is Putting Pressure On Software QA. Here’s How Your Team Can Adapt

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on…

48 条评论
Hate Sales? Why Even Product-Led Founders Need To Get Comfortable Selling

2024年3月18日

Hate Sales? Why Even Product-Led Founders Need To Get Comfortable Selling

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on…

26 条评论
How To Actually Keep Software Developers Happy

2024年2月8日

How To Actually Keep Software Developers Happy

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on…

11 条评论

See all articles

How A Little Bird Could Have Prevented The CrowdStrike Disaster

Jyoti Bansal

Entrepreneur | Dreamer | Builder. Founder at Harness, Traceable, AppDynamics & Unusual Ventures

The limits of software testing

领英推荐

A pipeline problem

Entrepreneurship & Leadership

36,881 位关注者

Jyoti Bansal的更多文章

社区洞察

其他会员也浏览了

Navigating the Abyss: Understanding IT Failures and Pathways to Recovery

Fortifying Futures: The Critical Edge of Cybersecurity in Private Equity

The House of Cards That IT Built

The Interconnected Nature of IT and Organisational Resilience Through the Lens of Oppenheimer and Einstein

Intersection – Navigating Market Challenges: Modernisation of Legacy to enable business models and strategy.

CrowdStrike Aftermath: Trust Shattered – What Now?

Embracing Technology for Security Businesses: Thriving in the New Financial Year

The Wrap: Digital Resilience Wake-Up Call; TMF Funding AI Safety; DoD IG Eyes CJADC2

7 Ways to Be More Innovative

The Digital Frontline - A Retreat to Reading Books

The limits of software testing

领英推荐

A pipeline problem

Entrepreneurship & Leadership

36,881 位关注者

Jyoti Bansal的更多文章

CEO Rapidfire: Superhuman CEO Rahul Vohra On Why You Shouldn’t Fail Fast

Our Research Shows Software Outages Now Impact Half The Population. It’s Time To Act

Toil Is Still Hurting Software Developers. Here’s How To Fight Back

CEO Rapidfire: Monte Carlo's Barr Moses On Why Tech Trends Are Like Avocados

Sentry CEO Milin Desai On What Leaders Can Learn From Marvel’s Ant-Man

The 3-Layer Cake Strategy To Building A Company That Lasts

CEO Rapidfire: What Zeb Evans’ 4 Near-Death Experiences Taught Him About Leadership

AI Is Putting Pressure On Software QA. Here’s How Your Team Can Adapt

Hate Sales? Why Even Product-Led Founders Need To Get Comfortable Selling

How To Actually Keep Software Developers Happy

社区洞察

其他会员也浏览了

Navigating the Abyss: Understanding IT Failures and Pathways to Recovery

Fortifying Futures: The Critical Edge of Cybersecurity in Private Equity

The House of Cards That IT Built

The Interconnected Nature of IT and Organisational Resilience Through the Lens of Oppenheimer and Einstein

Intersection – Navigating Market Challenges: Modernisation of Legacy to enable business models and strategy.

CrowdStrike Aftermath: Trust Shattered – What Now?

Embracing Technology for Security Businesses: Thriving in the New Financial Year

The Wrap: Digital Resilience Wake-Up Call; TMF Funding AI Safety; DoD IG Eyes CJADC2

7 Ways to Be More Innovative

The Digital Frontline - A Retreat to Reading Books