Preparing for the crisis you can't prevent
CBS News - Bing Guan/REUTERS

Preparing for the crisis you can't prevent

For me, the CrowdStrike catastrophe of July 2024 has fallen into a pattern of recent outages caused by failures within third parties that provide centralized IT services to the market (e.g. Optus outage, Google's accidental deletion of UniSuper's data).

The uncomfortable truth about these outages is that, for practical purposes, they cannot be predicted and prevented by CIO's in client organisations.

Sure, those of us with a technical bent might enjoy academic speculations about each specific issue ("what if we used multiple EDR providers", "what if Microsoft banned kernel access" etc.), and naturally the third-party tech providers must and will learn and improve from each mistake; but a realistic view is that we will continue to experience these types of random "black swan events" in a technological landscape that is increasingly concentrated and centrally-controlled, given our industry is still fairly immature compared to other engineering disciplines.


So what's a CIO to do? Should we forget about cloud and off-the-shelf software and go back to the days of writing our own applications and running them on our own server farm in the basement? No. In most cases the benefits outweigh the risks, and most "big tech" providers are better resourced with more mature processes than our in-house IT teams, even though they are clearly not infallible.

Instead the answer (unglamorous as it seems) is crisis planning with a focus on business continuity: assume that a major IT outage will occur and work out how you will respond, and particularly how your organisation will keep its critical business functions running with manual processes. How will you ship products? Communicate with customers? Pay and collect money? Inform partners and stakeholders? Do your front-line staff know how to operate (albeit inefficiently) with paper/pen/calculator?

From a client organisation perspective, many of the widespread tech outages that I'm discussing here have very similar impacts to a major cyber attack - all of your IT systems are shut down for an extended period of time - so this type of crisis planning is valuable for both scenarios. It's also one of the best "bang-for-buck" risk controls you can apply, because there is usually no technology to purchase; the investment is in the time of key SMEs and business leaders.


If this is such a no-brainer, why do so few organisations actually do it? Many of the organisations I've worked with either don't have a Business Continuity Plan (BCP), or have a "token" BCP that is not fit-for-purpose. For example, "we will cut across to our test instance" or "we will recover all systems from backup within 2 hours" is not a BCP, it's an IT Disaster Recovery Plan (and not a very realistic one in many cases). They are different, and both are required.

In my experience there are two key obstacles to overcome to get an organisation on a path towards effective planning for an IT crisis:

  1. Disaster myopia - it's human nature to avoid thinking that "it will happen to us" and instead focus on immediate pressures. Unless the organisation has a corporate memory of living through such a crisis, or has a strong and influential risk management function, it can be hard to gain top-down support.
  2. "It's the CIO's problem" - IT teams should be planning for how they can best recover from IT outages, but it is not their job to plan how business processes can run in the absence of IT. This is the responsibility of the whole ELT (and especially the COO and CFO), overseen by the Board/ARC.

Here's a good example of the right mindset, from the CEO of UniSuper: https://www.investmentmagazine.com.au/2024/06/an-implausible-planning-scenario-inside-the-unisuper-member-services-outage/

I urge organisations to give priority to creating an effective BCP, accepting that the next IT crisis is just a matter of time. Of course, I also think that it's worthwhile to engage an outside expert (with lived experience of crises, not theoretical knowledge) to guide and challenge your thinking process.

Danielle Sandrazie

People Leader | IT Manager | Mentor | Strategic Technology Roadmap | Program Manager | Program Lead - Women 4 STEM (What’s Hot in Tech)

3 周

Great insight, thanks for sharing Ellis.

回复
Esrael Maru BA,MA,CPA,IIA

Chief Audit Executive (CAE) at Toyota Motor Corporation Australia Ltd

3 个月

Many thanks Ellis for sharing your great insights. The inevitability of these "black swan events" really underscores the need for robust crisis planning and business continuity strategies. It's a sobering reminder that while we can't prevent every outage, we can certainly prepare for them.

Boris Petukhov

Management Consultant | Doctorate in Project Management | SFIA Level 7 Project Manager | ISACA CISM | ISACA CRISC

4 个月

Hope for the best but prepare for the worst.

回复
Cathy Thomas

EGM Technology, Analytics and Business Improvement leading strategic improvement and transformation.

4 个月

Great advice Ellis and while it seems "unglamorous" having a robust and tested BCP is critical to ensuring you can keep serving or communicating with your customers in the event of an unplanned outage of your IT systems.

John Schumacher

Delivery Executive

4 个月

Always enjoy your insights, Ellis.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了