First CrowdStrike then Azure: how to protect yourself when the infrastructure is falling to pieces?
The reports of a DDOS (distributed denial of service) attack and the subsequent degradation of the Azure platform between 1145UTC and 1943 UTC on 30 July 24 did not go unnoticed.? Given the size of the Microsoft estate, and therefore the target it presents, we should be glad - in one sense - they are competent at least to recover from a large event such as this.
However, and perhaps more tellingly, the Microsoft statement released after the event said “initial investigations suggest that an error of our defences amplified the impact of the attack rather than mitigating it”.
We have had in rapid succession two events on the Microsoft platform, and although one of which was not caused by Microsoft, the result of both has been large-scale disruption to elements of the MS platform because of software irregularities and/or incompatibilities. The first was the Crowdstrike update causing Microsoft to crash; the second was Microsoft’s own defence system which mistakenly amplified the impact of the denial of service attack.
Let’s not lose perspective here. Both problems were mitigated quickly, particularly given the size of the affected estate.? Both organisations responded quickly and explained the problem - clearly, if not acceptably.? Yet here we are, many of us consumers of Microsoft, wondering if the next outage, however it might be caused, might sink our business despite all the efforts we are making to keep ourselves secure, safe in the knowledge that should this happen, contractually our recourse against Microsoft is frankly zero.
What are our options? We can always move to AWS, Google or the others. But that’s a hassle. Most of us are not so dependent on the Internet that we absolutely need 100% uptime. But it would be nice to know that the products deployed, the people employed – and the processes intended to assure quality - are all working as they should.
领英推荐
There is a bigger issue at stake, and one which the regulators are getting increasingly nervous about, particularly in financial services at the European level. Concentration of cyber risk in a small number of third-party suppliers is a big concern for the drafters of the EU’s DORA (Digital Operational Resilience) rules that come into force in January 2025, for example.? And the U.K.’s forthcoming cyber security and resilience bill will also address (or seek to) this issue. If a major cloud provider goes down, how much of our national infrastructure will fail as a result? Too much.
We talked in previous posts about the need for diversity and defence in depth.? This is expensive in the minds of many. But the calculus is simple and as Crowdstrike and Microsoft have shown, we don’t need the Russians or the North Koreans: what we need is a fat finger error, and the show is over, unless we realise that resilience means diversity of supply, not only of operational technology but also the ability to use different IT and communications infrastructure should one fail.
The IT megaliths we rely on are so big that their failure is as unimaginable as it is terrifying. Their complexity (e.g. the Microsoft DDOS attack affected the following: Azure app services,? Application insights, Azure IoT Central, Azure log search alerts, Azure policy, Azure portal, Azure front door, Azure content delivery network) is staggering (I am by no means any expert on Azure but that’s a long list of stuff).? But we have to live with them in a way which at least gives us some hope that if the worst were to happen, we had a backup plan which would allow us to switch away from one to the other.? It should be possible to have relationships with more than one infrastructure provider without suffering any loss of service. And it should be incumbent on the infrastructure suppliers to ensure that there is a high degree of interoperability between them so that in the event of a failure in one, the switch over to another is almost seamless. We cannot have as a failure in one which leads to the failure of many particularly when the failure may be self-inflicted.
Is it all doom?? No.? These incidents show the importance of well thought through, communicated and tested Business Continuity and Incident Response Plans.? Through specific scenario planning, organisations can plot a course to deal with outages of this magnitude, however unlikely they might seem.? 7 hour and 58 minutes may not feel very long – but when you are a trading organisation for whom zero downtime is unacceptable, or when you have a Return to Operations target of less than 8 hours, this could have been very serious.?
Please get in touch if you would like to discuss this or any aspect of contingency and scenario planning, business continuity or incident response. #resilienceandrecovery ?#astaaracyber.