CrowdStrike: how a single corrupt file ...
foto credit: unsplash/Milad Fakurian

CrowdStrike: how a single corrupt file ...

... brought down entire industries, and what we can learn from it.


Swiss TV "10vor10" asked me for an expert opinion on the CrowdStrike incident and the massive outages around the world. As I answered their questions, I thought I would share them with you below. Thanks to Jan Beilicke and Pascal C. Kocher for some good discussion.

Concentration in the area of cyber security software providers — how big is the problem?

Every IT system today needs an endpoint protection solution like the one provided by CrowdStrike. Gartner lists almost 80 providers in this area worldwide, 15 of which are widely used. Although CrowdStrike is clearly dominant — with a market share between 18 and 24% —, they are not alone. And no, this is not the first time they messed up this year on Windows. They also broke Debian and Rocky Linux.

Rather, the problem is that endpoint security software, by its very nature, is deeply integrated into the operating system (close to the heart) and has many privileges, i.e., it is very powerful. Consequently, if there is a breach or an error in a powerful component near the heart, this can quickly have a negative impact on the entire system.

The question is whether a single corrupt file should be able to bring down an entire operating system. Is this consistent with a resilient architecture, or can we do better? Is such behavior acceptable today? Any third party could have injected a corrupt file. Should the OS prevent this from happening? My personal opinion: yes. We need to rethink computer architecture.

How can such a large-scale breakdown occur?

Several reasons. First, it's the need for security software: no system today can do without securing the endpoints. Almost all connected systems require live monitoring and control, whether as a matter of good practice, regulatory requirement, or simply because of the growing threat landscape.

Second, it's auto-updates, which automatically deploy updates without the operating system, the operator, or the end user being able to see them, to inspect them, to test them. IMHO, this practice urgently needs to be reconsidered. There is a conflict of objectives here between quickly distributing security measures and at the same time checking thoroughly enough what is being distributed. And yes, hardly any CISO would want to delay the installation of security updates. This is similar to smartphones, where we all recommend installing updates quickly. Either way, the stability (and therefore availability) of the infrastructure currently comes with too many single points of failure.

Third, it's supply chains: Lack of clarity, high complexity and a large number of vendors who have a good say in what happens on our devices and what does not. This also needs to be rethought.

The reason for the glitch was a failed update, not a cyber attack according to current knowledge. Does the error today also show the potential impact of a targeted attack on a large scale?

Even if the CEO of CrowdStrike says it was a faulty update in a single file, we must not be naive here. Where the error ultimately came from, whether an attacker had access to the development process, or whether the corrupt file still has an ulterior motive, can hardly be answered conclusively after such a short time.

Talking about the impact, we have seen it time and again in the past: if backdoors are carefully placed, it is almost impossible to detect them. If detection is successful, then there is often more luck involved than we would like to admit. And yes, it is a foretaste of what may come. Supply chains in the digital world have long been critical and are not monitored enough. Too many providers, high complexity and very high potential for damage. This makes it extremely attractive for criminals to carry out their malicious intentions.

What challenges do you see in the area of cyber security in the future?

If I reduce it down to four things, these would be: First, promoting heterogeneity at the cost of inefficiency, but for the benefit of resilience: If an airline or a hospital or a bank relies on a single tech stack, then what we saw on Friday is much more likely to happen than if they use a variety of different solutions. Mac OS and Linux, for example, were not affected, at least not this time. Although this measure is more complex and expensive, it should more often be taken into consideration in terms of resilience.

Second, we should?increase testing capacities. More security testing needs to be carried out on products and devices. This applies in particular to products that are not subject to regulation and are used on a large scale. I see a huge need to catch up here. Politicians are called upon to act here. Btw, legal frameworks such as the Cyber Resilience Act (CRA) require testing for products. This does not exclude Fridays or holidays. And shall not happen under pressure.

Third, every organization must also have a proper business continuity management concept (BCM), i.e. design and practice measures to avoid business interruptions. Analog processes with handwritten boarding passes are part of this reality. And they need to work right before the weekend and during the holidays.

Forth, we should reduce dependencies. Software development today is more a matter of plugging together existing components, classes, libraries, snippets — only a few percent of which are actually used, but which still carry a lot of unused and potentially vulnerable code that can be exploited by cyber criminals. We believe we are doing this for reasons of efficiency and cost. But it's the wrong approach. A reduction to the core functionality would be right.


For the German speaking, here is the link to the SRF - Schweizer Radio und Fernsehen interview with Arthur Honegger where the above is massively shortened: 10vor10 Enjoy!


Euan Dykes

Fronting the ship of innovation

8 个月

I learnt that other OSs are also venerable to this type error too. It's not uniquely a Microsoft concern. And the blue screen is a stop loss measure, so a system can safely recover without further damage. It's one thing to have to manually reboot PCs, it's another thing if data is lost and hardware needs replacing.

回复
David Gugelmann

Fоunder, Chаirman and Cо-CE0 at Exeon Analytics, Dr. sc. ETH

8 个月

Well said!

Markus Thomi

Senior Software Engineer Team Lead. Industrial automation.

8 个月

"Even if the CEO of CrowdStrike says it was a faulty update in a single file, we must not be naive here. Where the error ultimately came from, whether an attacker had access to the development process, or whether the corrupt file still has an ulterior motive, can hardly be answered conclusively after such a short time." The following link supports your statement and just shows that in our complex software landscape it is hard if not impossible to have everything under control: https://trufflesecurity.com/blog/anyone-can-access-deleted-and-private-repo-data-github

Beat Bischof

Senior Business Consultant; Senior Business Analyst, Digital-Strategie, Digitale Transformation, künstliche Intelligenz intelligent nutzen, Cloud, Unternehmens-Entwicklung, Ad Interim Management

8 个月

Very interesting article. I Only disagree about politicians caring for security. This would not lead to problem solving but to chaos. IT has to do this

回复

要查看或添加评论,请登录

Raphael Reischuk的更多文章

社区洞察

其他会员也浏览了