登录查看更多内容

Do You Have A Resilience Strategy?

Nick Drage

A practitioner of game-based methods to help you make more impactful decisions.

发布日期: 2022年6月7日

In light of recent outages in cloud services, such as AWS and Insteon, it’s time to plan how your organisation would react to an online service disappearing.

This thinking builds on this previous article by Indy Neogy and I.

Using examples as spurs to action

When deciding on what threats to your organisation to investigate, and therefore what remediation strategies to prioritise, there are three methods that can inform your decision:

“Threat Modelling” or “Failure Modes and Effects Analysis” - this uses external and/or internal expertise to examine how your organisation currently operates and rates potential failures ( intentional or accidental ) based on their likelihood and impact. This is the most informative but most difficult option from the three.
Direct experience - this uses failures or near misses at your organisation as lessons to learn from. Using each example can reduce the likelihood or impact of the issue being repeated. While this method captures many relevant issues, you run the risk of only discovering that an issue is catastrophic on the single occasion it renders your organisation incapable of recovery.
Indirect experience - this uses similar experiences of failures and near misses by others, letting the world generate data for you by the public experience of similar organisations. While arguably not the most accurate, this is the easiest method to use and a great way to get started.

Because that third method is the best combination of accuracy and ease of use, we’ll use it to look at three noteworthy incidents that affected essential online services to highlight where your organisation might have a forthcoming problem.

Atlassian Accidentally Deletes Customers' Sites - on April 5th 2022 a routine maintenance script deleted the sites of hundreds of Atlassian’s customers. Atlassian services include Jira Software, Jira Work Management, Jira Service Management, Confluence, Opsgenie, Statuspage, and Atlassian Access. Atlassian’s estimates of the time required to resolve the issue and the number of customers affected changed significantly during the incident. Atlassian are phasing out licenses for the on-premises alternative to their online services.

They have provided a full Post Incident Review on their site, although this piece by Network World is a great summary of learnings in a much smaller form.

The outage affected a very small percentage of their customers, but those organisations lost weeks of availability waiting for systems to be restored. As Atlassian products such as Jira tend to be the operational core of how an organisation prioritises and executes software related tasks, this outage must have required a significant reworking of processes.

OVH Datacentre Fire - The second incident was from March 2021 when the cloud service provider OVH experienced a fire in one of its datacentres in March of 2021. The SBG2 datacentre in Strasbourg was burnt down, with the incident also affecting the neighbouring datacentres of SBG1, SBG3, and SBG4.?

This reportedly rendered thousands of websites unavailable according to some reports, millions of websites according to others, such as this article by The Register. And, as stated by Gartner senior analyst Tiny Haynes: "for a fire to destroy an entire data centre would raise questions around the operational effectiveness of the fire detection and suppression systems in place.” Further questions are listed at the bottom of this article: with only partial answers in this article by the same industry organisation. Do go to this link for an expert analysis of the fire.

Your online service provider may not have the resilience that you assume. Ensure that your organisation’s resilience is not dependent on the assumed resilience of supporting suppliers.

Insteon Disappears - The third incident is noteworthy because of how quickly the situation changed. Insteon’s service (motto “We’re keeping the lights on!”), was used by home-owners to manage their home automation products. Then the service simply vanished. All of the cloud services stopped working without any warning and the company didn’t answer any queries. Notably, the CEO deleted their Insteon role from their LinkedIn profile, before deleting their entire profile.

After a week a relatively non-committal message was placed on the company website, further details from Ars Technica. The company appears to have had some warning of its impending failure, but chose to not inform customers and simply turned off everything those customers depended on. While this specific example is unlikely to affect organisations, it is a real example that illustrates the risk of a provider simply stopping a service, and therefore the requirement for you to react to such sudden changes.

Strategic Responses

Your COM, or Current Operating Model, assumes that everything will stay as it is. If an online service fails, in that it is no longer able to provide the service you need, then suddenly your COM has become a Target Operating Model, where you’d like to be. The key difference here, opposed to more traditional failure states associated with on-premises equipment, is that your organisation has no influence over the restoration of the service, only in its replacement.

Faced with such situations, consider these strategies and the deciding factors when selecting which strategy, or combination of strategies to use:

领英推荐

Architect for Failure Business Continuity in the Cloud

SoftServe 2 年前

The Digest for Atlassian Admins: August 2024

REVYZ | Backups, Data and Config Management for Atlassian Cloud 7 个月前

Dos and Don'ts of IT Infrastructure Modernization

Fourth Dimension Technologies 7 个月前

Pre-established Fallback position - Look at the services that your organisation depends on for business as usual and ensure you have an alternative option available if a service temporarily or permanently fails. Ideally, you have a complete inventory of all the services your organisation uses. But that asset enumeration can be such an arduous task that the project never gets started or this first step never gets finished. It is best to either pick a very limited scope, or arbitrarily decide that you have discovered sufficient services to continue. In this case doing the right thing with limited information is much better than not doing anything at all.

From there, work through those services methodically. Do you start with the services that will have the greatest impact if they fail? Or the services that are most likely to fail? Or the services that you know about? Or at random? For many organisations, it does not matter which one they choose.? It is more important to pick a methodology and start than it is to spend resources agreeing on the ideal method.?

Determine your “relative failure position” - some online services are so ubiquitous that if they fail, it seems as though every service has broken. If you’re using a service that everyone else is using, what’s the relative impact of that service being unavailable? As shown by failure in services such as Amazon Web Services in 2021, and Cloudflare in 2020, if all your competitors are affected, and the outage is short term, the relative impact may be zero. Similarly, if all the organisations you work with are affected, your operations will probably be idle waiting for all of those affected to return to work, and any contingency you have in place will be unnecessary.

Obtain “failure intelligence” - It is common for companies to purchase “threat intelligence” on the cyber security threats that may be directed against their organisation, or “vulnerability intelligence” on the weaknesses within their infrastructure/estate. Those two terms can be poorly defined and used interchangeably, when they should not be - but “failure intelligence” is even more nebulous. For either internal or external providers, ensure that the services provided are clearly defined within your contracts. Also ask how the service provider operates and what risks they are exposed to. Are they due to be taken over? Is a significant part of the company within a politically unstable region? And so on. With this information you can either prioritise a fallback position for those service providers with the highest risk, or pre-emptively move to a lower risk service provider in advance of an expected failure.

Specific Plans versus Generic Capabilities - do you set up an increasingly complex “if X happens we do Y” matrix and runbooks or do you develop a set of strategems and capabilities to call on whenever anything in the general class of this problem happens? It can feel more reassuring to do the former, but people tend to underestimate the exponential complexity that comes from making such specific plans, and keeping them up to date. Instead - do you have enough capability in your organisation to react to an unexpected CSP failure or are staff operating with enough “headroom” to put in extra time to find a new supplier without affecting business operation?

Overall, it’s important to remember that a combination of strategies can be valid. Commonly, an organisation will decide that a few of the services it uses are so ubiquitous that the relative impact of their failure is small, a few are so important that pre-agreed and pre-planned fallback positions are warranted, and for everything else they maintain sufficient overhead to deal with any unexpected failures.

Practical steps

Faced with having to make a decision between the different strategies presented above, and which services to apply them to, the problem of where and how to start still applies. We recommend trying a “Minimum Viable Product” of each of the following, and see how they work for you, and for your organisation.

( And if you’re having trouble with any of these, please do contact us to see how we can help. )

Adversarial analysis - have staff think of specific services they and their colleagues depend on, and the ways in which those services could fail. The best approach for this exercise is to send your staff challenging questions to help them think through their knowledge. For example, ask them which services are single points of failure, which feel like the most likely to fail, which service failure would be the most damaging or amusing or surprising.

Wargame the situations - pick a particular failure scenario, and have staff - representing themselves or their departments - step through how they would have to act and react within that situation. Have people formally state how they would react, and if necessary prove that their reaction is possible. This exercise can be useful for highlighting that contacting services, or obtaining the permission to replace them, isn’t as easy as employees assume. This can be a structured meeting, or it can use more formal matrix game argument formats. When the organisation becomes more advanced or experienced in this practice, this can become more mechanical, with the specific resource allocated to issues being tracked while that resource is needed by normal business operation.

Please note - while there is no active opponent when exploring these scenarios, the term “wargame” applies because the outcome of the exercise is unknown, and determined by the participants and the organiser as it is played. You may or may not find the term “wargame” suitable in your organisation, use whatever synonym works for you and the significant stakeholders.

Service Bill of Materials - Just start building a “bill of materials”, a central index of what services are used within your organisation. This can be a great way to build links with other silos, as well as discovering any existing repositories of this information. You may discover other formal or informal asset managers within the organisation, and build on each others’ efforts.

Summary

The most important decision here is the decision to start, to do something and iterate from there depending on success and feedback. Being prepared for this kind of failure from service providers is a complicated subject, and is a large and never-ending task. But it's crucial to bear in mind that any effort put into this process is rewarded, compared to being not prepared at all for service outages, infrastructure fires, or your lights going out. Using the three practical steps listed above, start your process today.

( Thank you to Sarah Ramsey, Michael Shafer, Tom White, and Russell Smith of the Foster community for their feedback and help with this piece )

Kate Hammer

Existential psychotherapist, coach, psychologist, in private practice in London EC1 and online

2 年

It’s been years since I brushed up against these concerns, so my perspective may be na?ve. I still remember Mickey McManus educating me about a trillion nodes… That said, the blindspots your article highlights seem credible to me; and the practical steps both viable and worthwhile, Nick Drage Indy Neogy Sharing with business information folks Anthony Scriffignano, Ph.D. , Giulietta Sabrina Branduardi , John Lord Ryan Morrison

2 次回应

Indy Neogy

2 年

It's alive! The article I mean. (I hear the algorithm prefers comments that are over 10 words long.)

1 次回应

查看更多评论

要查看或添加评论，请登录

Nick Drage的更多文章

The Friend Device - it raises so many questions.

2024年7月31日

The Friend Device - it raises so many questions.

With the very recent announcement of the release of the Friend device we had a quick discussion in the Path Dependence…

5 条评论
If you're trying to do too much, just do less.

2024年5月9日

If you're trying to do too much, just do less.

Often a post of article on LinkedIn, or similar sites, will explain - quite rightly - that sometimes it's good to focus…

7 条评论
What can the "Brotherly Shove" teach us about strategic response?

2023年11月23日

What can the "Brotherly Shove" teach us about strategic response?

Strategic response is a fancy phrase for “the situation has changed, how do you respond?” Every one of us that has…

5 条评论
Resources for Generalists - The "Range-o-Verse"

2023年9月7日

Resources for Generalists - The "Range-o-Verse"

After a recent conversation around what being a "generalist" is, what it means, and how to use that in the job market -…

50 条评论
Don't react to this article, unless you really like notifications.

2023年9月5日

Don't react to this article, unless you really like notifications.

How often can you edit an article on LinkedIn? I started writing something in public but it didn't get that much…
Startup Security Strategy

2023年6月20日

Startup Security Strategy

On making decisions and foreseeing consequences You’ve a killer concept and the passion to take it far. You’ve done…

1 条评论
How to get into what we used to do...

2023年4月27日

How to get into what we used to do...

This is a quick post I co-wrote with Indy Neogy, with insightful advice on how to make the kind of career moves we…

1 条评论
On Planning the Destruction of the Rebel Alliance

2023年4月22日

On Planning the Destruction of the Rebel Alliance

Last week, as part of the Connections Online conference, I look part in a "Red Teaming Workshop". For this exercise we…

8 条评论
Strategies for a VUCA World - Part 1 - Is VUCA bearing down on you?

2022年10月20日

Strategies for a VUCA World - Part 1 - Is VUCA bearing down on you?

Speculative fiction author William Gibson is famous for saying "The future is already here – it's just not evenly…

6 条评论
How Can We Help Each Other?

2022年7月27日

How Can We Help Each Other?

How Can We Help Each Other? What’s your preferred working situation - completely alone and self-driven? As half of a…

6 条评论

See all articles

Do You Have A Resilience Strategy?

Nick Drage

A practitioner of game-based methods to help you make more impactful decisions.

Using examples as spurs to action

Strategic Responses

领英推荐

Practical steps

Summary

Nick Drage的更多文章

社区洞察

其他会员也浏览了

Ensuring 24/7 Operations With Continuous Integration And Continuous Deployment

Bocada Cloud Adds Autotask PSA Integration for Streamlined Backup Incident Management

Building Resilient Systems: AWS Best Practices for Business Continuity

Mastering IT Management: Strategies for Modern Businesses

Boosting Productivity: 7 Proven Strategies to Reduce IT Downtime

Achieving Seamless IT Standards Compliance with Smartcomply Secure

The Benefits of Cloud-Based Business Continuity

Navigating DORA Compliance: The Critical Role of Application and Infrastructure Discovery

What if? What if? What if? How Software Escrow Takes the Uncertainty Out of SaaS

Understanding SysOps: A Comprehensive Guide to Systems Operations

Using examples as spurs to action

Strategic Responses

领英推荐

Practical steps

Summary

Nick Drage的更多文章

The Friend Device - it raises so many questions.

If you're trying to do too much, just do less.

What can the "Brotherly Shove" teach us about strategic response?

Resources for Generalists - The "Range-o-Verse"

Don't react to this article, unless you really like notifications.

Startup Security Strategy

How to get into what we used to do...

On Planning the Destruction of the Rebel Alliance

Strategies for a VUCA World - Part 1 - Is VUCA bearing down on you?

How Can We Help Each Other?

社区洞察

其他会员也浏览了

Ensuring 24/7 Operations With Continuous Integration And Continuous Deployment

Bocada Cloud Adds Autotask PSA Integration for Streamlined Backup Incident Management

Building Resilient Systems: AWS Best Practices for Business Continuity

Mastering IT Management: Strategies for Modern Businesses

Boosting Productivity: 7 Proven Strategies to Reduce IT Downtime

Achieving Seamless IT Standards Compliance with Smartcomply Secure

The Benefits of Cloud-Based Business Continuity

Navigating DORA Compliance: The Critical Role of Application and Infrastructure Discovery

What if? What if? What if? How Software Escrow Takes the Uncertainty Out of SaaS

Understanding SysOps: A Comprehensive Guide to Systems Operations