The CrowdStrike / Microsoft Chaotic Outage
John Reeman
Protect your business against cyberattacks. I help law firms, professional service firms & mid market organisations implement my blueprints so they don't become the next cybercrime victim | Virtual CISO | Data Security
Here are my thoughts on what transpired last Friday concerning the CrowdStrike / Microsoft global IT outage.
Like many of you I was indirectly caught up in the crisis first hand seeing disruptions with Retailers and Airlines ( thankfully my flight wasn't impacted).
Reading through many post comments this morning, I sense a mixture of anger, finger-pointing, frustration, empathy, and compassion for those fighting to get systems back on line as quickly as possible. For those of you like me who have been in a crisis situation will know how challenging and stressful that can be so to all CrowdStrike employees and those hard working IT support staff in the thick of it right now, stay strong it will get better.
Some History
What happened on Friday could have happened to anyone but unfortunately for CrowdStrike it was at a scale that businesses globally had neither seen before and were not prepared for.? In the past, for those of you who remember, we had the ILOVEYOU virus, Slammer, and more recently, NOTPETYA.? While all of these were impacting to those affected they were not on the scale of what happened on Friday.
More than ten years ago, I wrote an article and spoke about the pervasiveness of software and, in particular, VMware, and warned then, that it may not happen tomorrow or next week, but years from now, we would have a catastrophic technology disaster.? As much as we can blame the vendor and sometimes quite rightly, we collectively as humans have to take responsibility for our own resilience and do the right thing.? Vendors release patches; it's up to us to implement them as fast as we can and in a safe way so as not to impact the business. ? Equally, if we don't implement the patch, then we can hardly blame the vendor for not trying.? But in that conundrum, there is always the balance of risk.? Do we go quickly in order to protect the organisation, but at the risk if the patch fails to cause business disruption, or do we wait but at the risk we may get compromised by an attack exploit?? That decision essentially boils down to risk appetite. ?
The Friday event and CrowdStrike Response
What happened on Friday no one saw coming, no warning it just happened which led to some thinking was this the beginning of some kind of Cyber War.? Earlier that day Microsoft also suffered a significant outage and although not on the same scale as CrowdStrike was none the less a double whammy for those organisations affected.? Shit happens, and it is refreshing to see that after the initial marketing speak communications from CrowdStrike that their CEO today stood up as he should and took accountability and apologised for what happened.? For that transparency and leadership I commend CrowdStrike and all organisations should take heed (In particular, Microsoft I hope you are listening...)
领英推荐
Technicals
A lot of people have already commented on the technical aspects of what went wrong and all I will say is that for software like CrowdStrike that is so deeply embedded in the Microsoft OS tech stack (e.g., kernel), then Microsoft should take some responsibility for what happened on Friday.? If I'm not mistaken, there should be a QA process for software that hooks into the kernel before it is allowed to be released, both on the software vendor and the OS vendor.? So, processes clearly failed here.? The same goes for deployment, there should have been a fail safe process, deploy to N+1 and then wait, not simply hit the button and go!
Business Continuity
I see a lot of commentary on BCP and DR plans.? The reality of what happened on Friday is that hardly anyone could have predicted that type of scenario, and even if they had, they would have probably dismissed it as unlikely to ever happen and, therefore, low risk.? The fact that it has been so impacting on EUC environments and that since COVID, our working lives have changed, meaning many people now work from home a few days a week or permanently, has made remediation efforts by already stretched IT support staff in many businesses challenging. So this needs to be considered too in future BCP plans.
Eggs in one Basket
Think about your technology stack and single points of failure particularly where your tech is pervasive. For security although a cliche defence in depth or as I like to look at it a layered onion approach is always preferable and will give you options for resilience. Not that long ago organisations used to have multiple Firewall stacks (different vendors) at each layer. May be think about using a different vendor for AV / EDR on your Server stack vs your EUC environment. That again seem like a costly old school approach but some times the old ways are the best!
A New Chapter
Today is Monday, and it's time to turn a new page. Because of our interconnected world, we must rewrite BCP and DR plans, conduct regular testing, and account for supply chains and SaaS platforms that are so intertwined. So, take a hard look at all of your technology and consider how pervasive it is and what interconnections it has across your lines of business. Only then can you make a more informed risk decision about what you need to do and your accepted risk tolerance levels.
Reach out if you need any help or guidance, particularly for small to medium size businesses that may be suffering I'm offering my services Pro Bono (DM me here in LinkedIN).
A Proud Dad, Passionate Cyber Security Leader & Basketball Fan...
4 个月Great writeup John Reeman ??