Reflecting on CrowdStrike and the IT Industry - Part One

Reflecting on CrowdStrike and the IT Industry - Part One

I have spend the last few days reflecting on the whole CrowdStrike incident and watching the various media feeds incorrectly reporting on it, using a narrative and language, that at times has felt more like sensationalism verging on disinformation rather than reporting the actual truth. So, below are some of my observations and thoughts on the matter.

I have deliberately removed some of the technical details because I do not want to focus on the details but rather the main facts that affected many organisations around the world on Friday 19 July 2024.

CrowdStrike sells a cyber security solution that provides monitoring and protection software for an enterprise. Their solution protects endpoints, such as desktop computers and servers, and monitors network traffic and connections being made outside the company network to other networks, watching and protecting the organisation. Their software runs on Windows, Mac and Linux operating systems.

On 19 July, CrowdStrike issued an update to a piece of software, called Falcon Sensor, which runs on the Microsoft Windows operating system. The update had an error in it and installing it caused the Windows operating system to fail, resulting in a blue-screen-of-death (BSOD) and a non functioning computer. The update was live for less than two hours before it was removed and made unavailable. It is thought that during that time around 9 million Windows computers installed the update and subsequently failed.

CrowdStrike is aimed at large organisations and therefore, when it broke Windows, it broke the large organisations that society relies on, such as the banks, airports, and health services. Core services stopped working and chaos ensued. Then out of the chaos the torches were lit and the witch hunt started as the world looked for who to blame.

Fingers were pointed at CrowdStrike and their update and rightly so. It will be interesting to see whether they ever publish what actually happened and come clean to the world. I know they have taken responsibility for the error in the update but how did that error get through their testing processes and into a release version of software. You have to ask the question: Was it actually tested? Clearly they have some internal process issues that they are going to have to address and I hope they eventually tell us all what actually happened. Trust relies on transparency.

Moving on from CrowdStrike , 微软 had some focus for a while as being partly to blame as they has signed the update. Microsoft signs, applies a digital signatures, to some software packages, particularly those that run at the kernel level, in order to verify the integrity of software and the identity of the software publisher and to authorise the operating system to load and run the software. This is not a testing process but rather a confirmation that the software has not been modified and was written by the person who it says wrote it. Despite this there were plenty of news cycles blaming 微软 for the chaos though you cannot blame someone for an insecure house when you have made them give front door keys to anyone who asks. I am referring Microsoft's Interoperability Commitment made 16 December 2009 which was in response to competition concerns raised by the European Commission. It effectively opened the door to the operating system, allowing third-party software to run at a low level in Windows, which means if the third-party vendor gets it wrong then they can break the operating system.

Now for my reflections or rants about the events that unfolded and where I feel responsibility lies.

IT Managers, System Administrators, Domain Admins. I am sorry but I feel a lot of the responsibility for this mess lies on your shoulders and here is why.

You are the guardians of these systems, you are responsible for everything that happens on them, that is your role. This chaos was created because nearly 9 million Windows endpoints were automatically updated in less than a 2 hour period during the early hours of Friday morning.

Where were you when this was happening?

Where was your testing before deployment to the live system?

Where was your managed and staged roll out across the system?

Where was your contingence plan and disaster recovery plan?

We as a profession need to reflect on these events and start taking back control and ownership of the systems we are entrusted to manage and secure.

There was a time, not so long ago, where our enterprises were running various flavours of Windows and numerous application versions across them and even a Windows Update had to be tested and the roll out managed to ensure nothing broke. I am not suggesting we go back to that world but should we be opening our systems to third parties and giving them domain admin rights and full unrestricted access to our networks to install whatever they want?

There is a perception that we are better protecting our systems by allowing third parties to manage the security of our systems. We have some how transferred the risk and responsibility to them. The trouble is it is only a perception. The system security is still our responsibility. The system integrity is still our responsibility. The system availability is still our responsibility. When you enter the domain admin password and authorise automatic updates across your enterprise you are taking on the responsibility of the whole system and the actions of the third party you have just opened your system to. And remember, if they get it wrong their terms and conditions will absolve them of any liability. Their problem will be your problem.

We need to have a very critical look at our systems and our policies and our procedures and we need to reflect, as a profession, on what we are doing and where we are walking to. There will be organisations hit by this issue that are SOC 2 or ISO 27001 certified. These standards cover managing software updates and the need to test software updates before deployment. Are we doing what we say they should be doing?

Now lets consider the chaos again and look at it from the point of view of members of the public, say the passengers on the 5000 plus flights that were cancelled. They are unhappy and want compensation but companies are saying they are not entitled to compensation because these were "Exceptional Circumstances". In law, "Exceptional Circumstances" means circumstances that could not be reasonably foreseen and for which there was insufficient time to take the necessary action to resolve the situation arising from those circumstances.

Were they really circumstances that could not have been reasonable foreseen?

We document managing software updates. We document the need to test updates before deployment. ISO 27001 covers this as does NIST SP 800-40 Rev. 4. As a profession, we certainly could reasonable foresee these circumstances occurring and we also know how to avoid them occurring. There will be some organisations now hiding behind an "Exceptional Circumstances" clause to avoid compensation claims that have actually documented the need to test before deployment as part of obtaining there ISO27001 or SOC 2 certification.

My last observation is the need to not only consider third-party service providers, such as CrowdStrike and how they affect our organisations but also how they affect the organisations that provide third party services to us. For example, the UK NHS Services were taken offline because a third-party supplier, used to run the NHS login services, had been affected. A Government Service reliant on how a third-party supplier manages their software updates.

We are building complex systems that we are becoming more and more dependent upon and at the same time we have less and less control of those systems, becoming more and more reliant on other organisations getting their planning and operational management correct.

We are knowingly building houses of cards and we should not be surprised when they start to fall down.

Do I have any answers?

No, not really.

But the first step in solving any problem is recognising there is one, and we certainly have a problem.









要查看或添加评论,请登录

社区洞察

其他会员也浏览了