CrowdStrike or the resilience conundrum
We all woke up this morning to news of a major computer outage all over the world affecting large businesses such as airlines, hospitals, railway companies, telecom operators, and, according to NBC News, some 911 emergency services in the US.
Game designer Robin D. Laws just posted this on Mastodon, which I find funny because there's a strong kernel of truth to it:
"This is how the world ends, not with a bang, but with a Windows 365 cycling reboot error."
What's going on seems to be a little bit more complex (and hopefully a little less dire) than that, but is indeed an issue (from what I understand) driven by a CrowdStrike update initiated last night, but impacting Windows OS and effectively decommissioning computers that belong to networks that use CrowdStrike for their security until either the solution is disabled or the patch is installed. Apparently this needs to be done on a computer by computer basis.
The global impact of this update highlights once again a conundrum of modern network security: as attacks become more and more sophisticated, it is virtually impossible for businesses to protect themselves without relying on a small number of security software providers such as CrowdStrike. Paradoxically, this creates a single point of failure of sorts as a huge proportion of businesses worldwide use the same solutions. If these solutions fail, a large proportion of businesses go down, and cascading effects impact others due to interdependencies (a number of unaffected airports are closing down because the airlines or other airports are...)
领英推荐
At a time when the European Commission puts a strong emphasis on resilience and cybersecurity through things like Articles 40 and 41 of the EECC as well as directives such as NIS2, this raises interesting and complicated questions about the role of policy in this field. A lot of these EC mandates have yet to be implemented by national regulators, and we at Plum Consulting have assisted regulators in figuring out how to implement these necessary changes.
The incidents from today however, raise a real question about the role and limitations of policy when it comes to resilience and cybersecurity: if imposing measures on various critical businesses leads to an increasing dependency of said businesses on the same providers to meet the obligations put upon them, this in itself is a new and systemic risk of failure should said providers themselves crash or fail.
This is, in another guise, the tension between centralised and highly protected architectures, which rarely fail, but have massive impacts when they do, and decentralised but potentially weaker architectures, which may individually have higher risks of failure, but much lower impact when they fail.
There may be a risk that policy interventions designed to enhance security through standardised requirements lead to reliance on just a small number of vendors who have the capability to meet these standards at scale. Paradoxically this could lead to overreliance on systems which are themselves capable of failure. Maybe some thoughts need to go into different, less centralised models that would mitigate the systemic risks a little better?
Computer Performance Modeling and Analysis
7 个月Auto update should not be considered a best practice, except perhaps for virus definition updates. It is nether good for security nor reliability . If you don't have an integration team, then wait a day, having listened for screams.
Most telecom operators until the late 1990s had a multi-vendor strategy, with multiple suppliers for every key system (switches, transmission systems) including software strategies with multiple versions. This protects against CrowdStrike alike incidents. Organisations contracting with two different 'security vendors' and installing half their Windows systems in an airport terminal hall on Vendor A and the other half on Vendor B would keep services partially running. But it does require in-house highly skilled technical staff, the ability to define crips and clear interfaces between vendors, conformance and interoperability testers and more operations staff. * Software Soaking * A and B-sides, immediate switch-back / roll-back capabilities * Only one side first upgraded and the other after observed faultless operation period * roll-out first in small operational field sites, then mid-size and ultimately large systems It is doable. It is costly. The deeper issue alas is the attractiveness for many executives outsourcing "responsibility" to single vendors, economizing on their own IT-staff,
Thank you for this post, cher cousin ! I’m currently sitting in Chicago O’Hare airport, not sure yet when I will board and take off to Chicago… Thank you CloudStrike!????
Thanks Beno?t. This resonates across the resilience projects I've led for ICT companies in Africa and Asia. diversity and preparation across a system = resilience More can be done on a regular basis to conduct systemic audits. This can remove or mitigate at many of the risks that emerge from the combination of layers (hardware, software, process, people, and regulation) upon which we depend.