Do You Have A Resilience Strategy?
Nick Drage
A practitioner of game-based methods to help you make more impactful decisions.
In light of recent outages in cloud services, such as AWS and Insteon, it’s time to plan how your organisation would react to an online service disappearing.
This thinking builds on this previous article by Indy Neogy and I.
Using examples as spurs to action
When deciding on what threats to your organisation to investigate, and therefore what remediation strategies to prioritise, there are three methods that can inform your decision:
Because that third method is the best combination of accuracy and ease of use, we’ll use it to look at three noteworthy incidents that affected essential online services to highlight where your organisation might have a forthcoming problem.
Atlassian Accidentally Deletes Customers' Sites - on April 5th 2022 a routine maintenance script deleted the sites of hundreds of Atlassian’s customers. Atlassian services include Jira Software, Jira Work Management, Jira Service Management, Confluence, Opsgenie, Statuspage, and Atlassian Access. Atlassian’s estimates of the time required to resolve the issue and the number of customers affected changed significantly during the incident. Atlassian are phasing out licenses for the on-premises alternative to their online services.
They have provided a full Post Incident Review on their site, although this piece by Network World is a great summary of learnings in a much smaller form.
The outage affected a very small percentage of their customers, but those organisations lost weeks of availability waiting for systems to be restored. As Atlassian products such as Jira tend to be the operational core of how an organisation prioritises and executes software related tasks, this outage must have required a significant reworking of processes.
OVH Datacentre Fire - The second incident was from March 2021 when the cloud service provider OVH experienced a fire in one of its datacentres in March of 2021. The SBG2 datacentre in Strasbourg was burnt down, with the incident also affecting the neighbouring datacentres of SBG1, SBG3, and SBG4.?
This reportedly rendered thousands of websites unavailable according to some reports, millions of websites according to others, such as this article by The Register. And, as stated by Gartner senior analyst Tiny Haynes: "for a fire to destroy an entire data centre would raise questions around the operational effectiveness of the fire detection and suppression systems in place.” Further questions are listed at the bottom of this article: with only partial answers in this article by the same industry organisation. Do go to this link for an expert analysis of the fire.
Your online service provider may not have the resilience that you assume. Ensure that your organisation’s resilience is not dependent on the assumed resilience of supporting suppliers.
Insteon Disappears - The third incident is noteworthy because of how quickly the situation changed. Insteon’s service (motto “We’re keeping the lights on!”), was used by home-owners to manage their home automation products. Then the service simply vanished. All of the cloud services stopped working without any warning and the company didn’t answer any queries. Notably, the CEO deleted their Insteon role from their LinkedIn profile, before deleting their entire profile.
After a week a relatively non-committal message was placed on the company website, further details from Ars Technica. The company appears to have had some warning of its impending failure, but chose to not inform customers and simply turned off everything those customers depended on. While this specific example is unlikely to affect organisations, it is a real example that illustrates the risk of a provider simply stopping a service, and therefore the requirement for you to react to such sudden changes.
Strategic Responses
Your COM, or Current Operating Model, assumes that everything will stay as it is. If an online service fails, in that it is no longer able to provide the service you need, then suddenly your COM has become a Target Operating Model, where you’d like to be. The key difference here, opposed to more traditional failure states associated with on-premises equipment, is that your organisation has no influence over the restoration of the service, only in its replacement.
Faced with such situations, consider these strategies and the deciding factors when selecting which strategy, or combination of strategies to use:
领英推荐
Pre-established Fallback position - Look at the services that your organisation depends on for business as usual and ensure you have an alternative option available if a service temporarily or permanently fails. Ideally, you have a complete inventory of all the services your organisation uses. But that asset enumeration can be such an arduous task that the project never gets started or this first step never gets finished. It is best to either pick a very limited scope, or arbitrarily decide that you have discovered sufficient services to continue. In this case doing the right thing with limited information is much better than not doing anything at all.
From there, work through those services methodically. Do you start with the services that will have the greatest impact if they fail? Or the services that are most likely to fail? Or the services that you know about? Or at random? For many organisations, it does not matter which one they choose.? It is more important to pick a methodology and start than it is to spend resources agreeing on the ideal method.?
Determine your “relative failure position” - some online services are so ubiquitous that if they fail, it seems as though every service has broken. If you’re using a service that everyone else is using, what’s the relative impact of that service being unavailable? As shown by failure in services such as Amazon Web Services in 2021, and Cloudflare in 2020, if all your competitors are affected, and the outage is short term, the relative impact may be zero. Similarly, if all the organisations you work with are affected, your operations will probably be idle waiting for all of those affected to return to work, and any contingency you have in place will be unnecessary.
Obtain “failure intelligence” - It is common for companies to purchase “threat intelligence” on the cyber security threats that may be directed against their organisation, or “vulnerability intelligence” on the weaknesses within their infrastructure/estate. Those two terms can be poorly defined and used interchangeably, when they should not be - but “failure intelligence” is even more nebulous. For either internal or external providers, ensure that the services provided are clearly defined within your contracts. Also ask how the service provider operates and what risks they are exposed to. Are they due to be taken over? Is a significant part of the company within a politically unstable region? And so on. With this information you can either prioritise a fallback position for those service providers with the highest risk, or pre-emptively move to a lower risk service provider in advance of an expected failure.
Specific Plans versus Generic Capabilities - do you set up an increasingly complex “if X happens we do Y” matrix and runbooks or do you develop a set of strategems and capabilities to call on whenever anything in the general class of this problem happens? It can feel more reassuring to do the former, but people tend to underestimate the exponential complexity that comes from making such specific plans, and keeping them up to date. Instead - do you have enough capability in your organisation to react to an unexpected CSP failure or are staff operating with enough “headroom” to put in extra time to find a new supplier without affecting business operation?
Overall, it’s important to remember that a combination of strategies can be valid. Commonly, an organisation will decide that a few of the services it uses are so ubiquitous that the relative impact of their failure is small, a few are so important that pre-agreed and pre-planned fallback positions are warranted, and for everything else they maintain sufficient overhead to deal with any unexpected failures.
Practical steps
Faced with having to make a decision between the different strategies presented above, and which services to apply them to, the problem of where and how to start still applies. We recommend trying a “Minimum Viable Product” of each of the following, and see how they work for you, and for your organisation.
( And if you’re having trouble with any of these, please do contact us to see how we can help. )
Adversarial analysis - have staff think of specific services they and their colleagues depend on, and the ways in which those services could fail. The best approach for this exercise is to send your staff challenging questions to help them think through their knowledge. For example, ask them which services are single points of failure, which feel like the most likely to fail, which service failure would be the most damaging or amusing or surprising.
Wargame the situations - pick a particular failure scenario, and have staff - representing themselves or their departments - step through how they would have to act and react within that situation. Have people formally state how they would react, and if necessary prove that their reaction is possible. This exercise can be useful for highlighting that contacting services, or obtaining the permission to replace them, isn’t as easy as employees assume. This can be a structured meeting, or it can use more formal matrix game argument formats. When the organisation becomes more advanced or experienced in this practice, this can become more mechanical, with the specific resource allocated to issues being tracked while that resource is needed by normal business operation.
Please note - while there is no active opponent when exploring these scenarios, the term “wargame” applies because the outcome of the exercise is unknown, and determined by the participants and the organiser as it is played. You may or may not find the term “wargame” suitable in your organisation, use whatever synonym works for you and the significant stakeholders.
Service Bill of Materials - Just start building a “bill of materials”, a central index of what services are used within your organisation. This can be a great way to build links with other silos, as well as discovering any existing repositories of this information. You may discover other formal or informal asset managers within the organisation, and build on each others’ efforts.
Summary
The most important decision here is the decision to start, to do something and iterate from there depending on success and feedback. Being prepared for this kind of failure from service providers is a complicated subject, and is a large and never-ending task. But it's crucial to bear in mind that any effort put into this process is rewarded, compared to being not prepared at all for service outages, infrastructure fires, or your lights going out. Using the three practical steps listed above, start your process today.
( Thank you to Sarah Ramsey, Michael Shafer, Tom White, and Russell Smith of the Foster community for their feedback and help with this piece )
Existential psychotherapist, coach, psychologist, in private practice in London EC1 and online
2 年It’s been years since I brushed up against these concerns, so my perspective may be na?ve. I still remember Mickey McManus educating me about a trillion nodes… That said, the blindspots your article highlights seem credible to me; and the practical steps both viable and worthwhile, Nick Drage Indy Neogy Sharing with business information folks Anthony Scriffignano, Ph.D. , Giulietta Sabrina Branduardi , John Lord Ryan Morrison
It's alive! The article I mean. (I hear the algorithm prefers comments that are over 10 words long.)