Operational Risk & Preventing The Next Global IT Outage
Mark Thompson
Institutional business development at the intersection of technology and financial services
After some initial hoo-har about it being some sort of cyber-attack, it quickly became clear that the massive IT outage that caused such disruption on Friday 19th July was just the latest example of that age old IT issue of an update going wrong. Some likened it to a horror movie where “the call came from inside the house!”
Crowdstrike, the ultimate source of the issue and the party who must bear the majority of the blame for the disruption it caused, both in terms of mistakes they made in their Operational Risk Management protocols, and also in terms of the fact that they are the main, and some would say only, provider of endpoint and end-user security at scale, need to make considerable improvements to their internal protocols.
However, there are other lessons that can be learned on an individual business-by-business basis and taken forward to help us all improve our resilience against such issues happening again in the future.
?
What actually happened on Friday 19th?
Within Crowdstrike Falcon software, there is a security module that is designed to locally manage the security of each “endpoint” to mitigate the risk of that endpoint being compromised by unethical means. As the “market” for unethical means is constantly changing and growing, regular updates are needed regularly to protect against the latest threats. 99.9% of the time you won’t notice these patches going in, but this was that rare 0.01% of occasions when something went wrong.
And it really went wrong.
It was a patch for any machine running a Windows Operating System (OS) that caused the seizure of so many computers. Additionally, due to the nature of the solution the Crowdstrike software is designed to provide, it has many tentacles that spread into all corners of operating system – an “octopus” of the software world. As such, any patch to such software would, by its nature, integrate itself into every corner of the operating system and hence the issues end users have had in resolving the post-update problems caused. Simply because they are so deeply ingrained into the operating systems.
?
What is an endpoint?
This can be anything from an employee’s laptop on the company network to a card payment terminal, so the range of devices that were affected and effectively crashed, was very wide and ubiquitous in terms of the interactions we have in our average, every-day lives. And this is why so much was affected, including stock exchanges, airlines, payment terminals and so on.
But is there a wider assignment of responsibility for managing this risk we could be missing? Within the tenets of Operational Risk Management, a topic that I will refer to a number of times, there are or at least, should be, ways of guarding against the overall broader affects of such an issue, particularly the fact it caused systems to grind to a halt, rather than just slowing them down. And by systems here, I mean the processes by which businesses function that we rely on technology to make more efficient. I feel that it is a very important distinction to make – the separation between systems and processes versus the technology we use to implement them.
Some of the issues faced are systemic in nature. Crowdstrike and Microsoft – the unholy alliance/combination of which caused the issue, are market leaders by some considerable margin, in the provision of endpoint technology and the security thereof. They exhibit such market dominance together that each on their own could be considered a monopoly and together, the “union” of the market share that they control is huge. Unfortunately, this means that there aren’t many, if any, other options for companies to choose to manage this part of their IT infrastructure, meant all of their systems went down, across the majority of industries. More competition in the provision of endpoint security would mean additional redundancy is possible and practical for a company to implement.
?
How to think about managing the risk of system issues caused by IT failures.
Operational risk management is the management or mitigation, or risk associated to people, processes or systems. The first two are self-explanatory but the final one includes the technical aspects, so computers and other machines.
The issue with the Crowdstrike outage was a failure of all of these factors. Crowdstrike will have had protocols, tests and checks in place in order to prevent issues like the one that happened. After all, as previously mentioned, 99.9% of the time, you wont notice the updates sent to your devices, so most of the time these control-points work. But every process is ultimately fallible, especially if there is a human involved. And without wanting to go too deep into the philosophy side of things here, no “technical” system, whether it be AI, a computer program or something hardware based where a human being is not involved at some stage.
Moving on, we return to the parts of the operational risk management process that failed here. Ultimately a human being designed and coded the required security patch and yet another human signed tested the patch according to Crowdstrike’s established control protocols, assessed there to be no risk for it to be rolled out, which could have been due to a software or hardware issue in the testing environment or a protocol designed without bearing this particular error in mind.
Then another human being signed off on the patch being released, but were all checks performed fully and properly? Could the issue have been something to do with the downloader software or the software that integrates the patch into the existing software “octopus”.
Finally, whilst it might not have been the issue here, there could be some setting locally on the client endpoint computers that caused their systems to crash. But these are all considerations that should be taken into account when determining if a software update is viable or not.
If you go to the doctor to diagnose an illness and they send you to blood tests, it is the doctor's responsibility to cover all possible bases and request all tests that can detect the range of problems that could be causing the symptoms. In this analogy, the doctor is Crowdstrike. They have taken on the responsibility of managing their clients threat-protection which they “guarantee” as part of their marketing and contract negotiations. As a result, they need sufficiently robust Operational Risk Management to match their client’s expectations. Here, they fell some way short.
?
What are some main considerations in understanding how to avoid this happening again?
Primarily the responsibility lies with Crowdstrike for not making such a glaring and massive blunder, which could and should have been easily avoided. However, as mentioned above, when humans are involved, mistakes are inevitable. All we can do is hope they are not big ones, or can me mitigated to such an extent that they can only ever be small or inconsequential errors.
So we need to proceed by assuming this error happens or could happen again. You can be sure that Crowdstrike is hammering their project, systems and risk managers to protect themselves against the same issue happening again, however, there is one issue. In business and finance, we tend to have very short memories. Which is ironic, bearing in mind that all financial and business models are based on past experience. There is a high probability that we will all have moved on from this within a month and relegated it to the status of a “character building day”. But should we be complicit with that?
领英推荐
It is definitely the cause-du-jour that we move on from what are, in reality, serious incidents, probably relieved they have passed and just look to get back to normal. But there is a huge amount of inherent risk in this approach.
The Credit Crisis from 2008 fundamentally changed the global financial (and non-financial) marketplace. As did COVID. But have we really changed anything that will prevent these from happening again. I think not. One could say we haven’t learned properly from these mistakes and when you consider that we are seeing financial bubbles brewing elsewhere in the market, each with their own associated risks, many of which are not a million miles removed from the 2008 crisis, one could argue we definitely haven’t learned! Here I am referring to potential crunches and the end of an economic super-cycle of sorts in the world of shadow banking, but more on that another time.
In managing such risks, primarily in this case Operational Risk, the first step in remediation is to cast our collective memories back further than we do currently. Yes, the market is a different place but if we take the conservative approach and recognise that ultimately we humans will look to cut corners to make life easier or take advantage of specific market situations in the pursuit of profit, which does border on the concept of behavioural science, we realise that certain behaviours and the issues they cause are, to a certain extent, predictable.
One of the best examples of learning from one’s mistakes and there being a long memory when it comes to not repeating mistakes, especially extremely public ones, is Formula 1. We still discuss Ayrton Senna’s crash and tragic death and that was back in 1994. Niki Lauda’s accident on the Nurburgring which, although sadly we don’t see the physical ramifications of on our TVs on Sunday any more, is still very much in the consciousness of Formula 1 stakeholders, and that was in 1976. The public memory is a lot longer when it comes to F1 managing the outcome of incidents where either some one or some machine made a mistake, and what possibly sets them further apart, they also learn from “near-misses”. They make adjustments based on what went wrong or something they see could have gone wrong during an incident and manage the potential for these errors to be made again in the future, as close as possible, to being impossible.
The example in my mind is Romain Grosjean’s crash in Bahrain. It was a combination of regulated fixes resulting from lessons learned from previous incidents that allowed him to fight and drive another day. There was the Halo over the cockpit that prevented him being literally beheaded by the barrier he crashed through, there was the enhanced design of the monocoque driver’s tub that is separated from the fuel tank in the rear of the car, the brace all drivers wear that fix their shoulders and support their necks in the cockpit and the protection offered by his personal safety equipment – helmet, suit, gloves, boots, undersuit etc – the combination of which meant he walked, or rather clambered and dived away from a wreck that could easily have claimed his life.
This is what a more robust “corporate memory” could offer the market and all of us users with it. It is an example we could model when it comes to establishing minimal or acceptable standards for the time horizon over which we need to observe and manage risks. Basically, its not just about the next week or month, we need to look in terms of years or even decades. Different horizons will make sense for different markets but we should be considering a far longer time horizon than we are now.
Ah, but could people really have died I hear you ask. Who’s to say they couldn’t. Tesla cars use computers to drive themselves, we all rely on satnav to show us where to go. Today’s modern aircraft are what are known as “fly-by-wire” which means they use computers to manage and mimic what used to be humans pulling on handles and controls which were directly mechanically connected to their flight surfaces etc. What if they had used a MS OS and had updated automatically on Thursday evening?
And there have been recent examples of such software causing planes to crash but can you remember the details, who was held accountable and what fixes were made? I for one would have to Google it, but I know they exist. It seems my memory is too short in this context as well.
?
So how could this have been avoided?
First of all we have already discussed the responsibility Crowdstrike bears in this situation and they need to be responsible for not rolling out fixes designed to make their clients machines more secure but end up crashing them instead. Everyone was reminded of that wonderful Microsoft “Blue Screen of Death” which we haven’t routinely seen for some time.
The second factor is the affect BigIT has on consumer choice. Whilst a virtual monopoly can sometimes be good in terms of the pricing of the solutions they offer the market, more often than not, with the removal or lack of choice of potentially comparable solutions from competition, so the potential for system redundancy is also removed. With only one real provider, you are directly dependent on their risk management practices and skills and more than that, there is the additional risk from all markets having to rely on such practices – often such IT failures are limited to one or two industries who rely on a particular solution to an industry specific problem. Here, it was like the whole world fell over for most of a day.
We have considered the necessity of increasing the timeframe of, and our focus on, our “corporate memories” and the importance that has on enhancing our perspective of managing similar risks going forward. We should be the sum of our experiences, not another iteration of our most recent two or three.
Next we will look at some of the more technical aspects of the Crowdstrike issue and what we can learn from that in order to reduce the risk of similar, high-impact situations in the future.
It only affected endpoints that used Microsoft Operating Systems so having a range of operating systems used across the corporate collection of endpoints, or even just alternative machines running Linux or Apple Operating Systems that can be used by people in Key-Person-Dependency roles. Who’s to say it won’t be iOS that is affected next time?
Remember here the range of endpoints we’re talking about, so this analysis must be extended to smart phones, tablets, payment terminals and so on.
There is a geographical play here too. The problem was first reported in Australia and as a result, a solution was being worked on very quickly at the start of the global business day but if you had IT infrastructure in the US, especially the west coast, there might be an opportunity to either intercept or roll-back the update before it became fully embedded.
Thinking about infrastructure, I am not sure how many companies do this anyway, but I have certainly observed instances where they take a full “copy” of their live server overnight and save it down, keeping it accessible (as read only, unless IT run a restore script for you) for the average user. It might not have worked in this instance but if you could effectively roll-back your “live environment” to the last one that worked properly in just an hour or two, wouldn’t that be great functionality to be able to call on in an emergency.
On a strategic level, this absolutely must be written into corporate Business Continuity Plans (BCPs) under sections on IT and Infrastructure. As a company, if you have clear instructions and key people identified upfront, you will manage such an outage far better in the future.
Summary
In summary, there is no one way to prevent this happening again, but with proper consideration and forward-thinking Operational Risk Management practices, it is possible to mitigate similar risks and prevent this specific instance from happening again.
That, and learn from Formula 1!
If you have any questions, points or contributions, please contact me at [email protected].
Applied physics.(JOIN ME) the work presented here is entirely new
4 个月In speaking with a friend from Microsoft, this does not bode well for Crowdstrike. Someone's going to lose their job. And although only one percent of their market was affected, that's still millions of users. Some users are still trying to recover. Some data may be lost. It depends. A pointer.... to some null space.... now wrapped up in a loop, within C++, .... could it have been a hack? Yes, it could have been someone paid to release the error. What is the biggest vulnerability experienced in light of this IT affair,.... ? The biggest problem we face may not be from the error and shutdown of 8 million systems... it may be the distraction it provided, to then hack other systems.... this is the biggest yet unrealized threat. Did this affect our militaries? NO, they use a different, GCC system. Could this error have been the result of Artificial Intelligence and coding therein? YES, it very well may have been. CAN WE TRUST ARTIFICIAL INTELLIGENT SYSTEMS....? NO. I THINK THIS is the resounding answer..... NO. And the worst may yet be still to come...... MARK applied physics