登录查看更多内容

Practice your Disasters

Andrew Collett

Project Manager | Leader | Team Builder | Problem Solver

发布日期: 2020年8月18日

We’ve all had a disaster or two in both professional terms and life I’m sure.

I’m no good at the life counselling thing, and whipping out a Disaster Recovery (DR) plan and Business Continuity Plan (BCP) when Daughter #1 splits up with boyfriend #4 doesn’t work, fact. Though having a stash of pringles handy and a forced re-watch of National Lampoons is a good plan to have.

But like all good plans a bit of planning before you have a plan is a good plan, and this article is about that, the basic interactions between three aspects of these plans and how you can enact them for your team or business.

Disaster is inevitable in some form or other, who’d have thought we would have the disaster of medium-long term limited office access (how’s the printing going), or more typically, if you’re moving an office you can treat the lack of access as a short term no access.

I treat DR and BCP in the same structure mostly as they very much work together. Simply put, if your ICT team is experiencing a disaster it’s likely that the rest of the organisation isn’t getting on with business, no continuity as it were.

And this is the lead into the first point, where does all this sit. I’ve found that DR is largely in the ICT domain simply because there needs to be structure in place to ensure a return to service in an orderly fashion, coupled with great communications. Business Continuity is with, logically, each business unit to ensure they’re able to work. Scenarios are really an outcome of the BCP process and serve to feed the DR planning.

It all looks a bit like this:

Disaster! you can't see the image. Continue reading, it's covered in the article.

Disaster Recovery

A trap for many is to put in place a DR plan for say flooding of the datacentre. That’s great but take it up to the top level. This is a loss of service, likely critical ones and that is what needs to be addressed.

Using a table like the one below (the key is the arrows) you can work through with your ICT team what’ll happen and when. And you only need one of these tables to cover your critical systems. I did one for critical and one for non-critical and that was our DR largely done.

Time and Resource matrix - easy to build

The way it works is to have your timeliness on the rows, it can have the spacing that works for you. I found that terms like: 0-5mins, 5-15mins, first half hour, within 1 hour, worked for those involved and clearly articulated what is happening when to the audience.

The second axis is the who. In this above example, I’ve included monitoring – which really is what’ll tell you that you have an issue. From there it’s the doing teams through to senior leadership. You can of course include external parties in this model too.

Within the sheet itself is the great bit, you can cover what is happening when, and with whom. It’s likely your ICT team will want to investigate for some time before escalation and switch to an alternate backup solution, after all, the issue may be in configuration or connectivity (both of which could transfer the fail).

And it’s likely the cost of switching solutions in a disaster scenario is one thing, but the effort to return can be just as hard.

You end up with a relatively simple spreadsheet that outlines who’s doing what to recover, if not the outright ‘press this, push that’ commentary (that comes later).

Business Continuity Planning

When asked to do this some years back I was concerned. I knew the ICT side of it, but that’s not the service chain, just one aspect of it.

And so I constructed a number of high level fail situations (No site access, No Database, etc) and asked business experts to think about what they would do to continue working for each.

Things I asked to be thought through included:

Impact on the business (Service/Financial/Anything else)
What mitigating actions were appropriate
Changes required to enable this and the target date to deliver (always have a date!)

The resulting spreadsheet ended up looking a lot like this:

We went through everything from one staff member not available for less than half a day through to the exec team disappearing for good, and simple system issues through to full and final failure. All in a simple matrix.

For each business unit I tend to say allow 1-2 hours to go through the interview, and at the end of it you’ll have a picture of what BCP is in place and what’s needed to ensure continuity in the event of a disaster.

A good example would be where a team couldn’t live without email for any real length of time. So when email went down for over a day the BCP was to communicate an alternate contact method (plausible in the systems we had). The action? Setup a Gmail account and have that on standby.

There’s a lot more to the structuring of this, but enough here, I think.

Scenarios

Once you have some high-level BCP situations covered and have a DR plan in place you can look to how you recover from specific situations faster.

Earlier I mentioned no access to the office. This is a BCP issue and should have mitigations in place. But it can be treated as a disaster, and probably should, knowing what to do in these scenarios certainly helps recovery.

I have a template, the contents of which are:

All fine sections I hope you agree. There’s a fair bit in each scenario document, but the effort here pays off when the "D" hits and you need to "R" as fast and as safely as possible.

To ensure you have the scenarios covered build a list of what can go wrong and where, perhaps not ‘flooded datacentre’, but certainly ‘loss of key database server’ or similar.

And this is where the title of this article comes in. Practice your disasters.

With the above three components in place, you can roleplay an outage, either in terms of running through the paperwork, or, as I did by fully experiencing the journey with the MD…

We decided to roleplay a critical outage when we were releasing an update to a beta test server. Essentially everything lined up with the live environment. The resulting journey through the DR plan and engagement with teams became all too real!

I strongly recommend this approach as it was equally scary and enlightening to have senior management get into the spirit of treating it like a full live event!

Fulfilling the testing means you can refine your DR plan, improve your BCP and review your scenarios as you improve, the cycle is complete.

Happy to discuss anything in this article as always. Drop me a line.

This article is my views based on experiences, training and observations of far too long in the technology arena. Your views, experiences and opinions are yours, valuable and equally pertinent. They’re just not mine and it’s easier to write about the stuff I know!

Evan Wheeler

Agile Delivery Lead - Enterprise Applications Management

4 年

Thanks Andrew, That’s a good read

查看更多评论

要查看或添加评论，请登录

Andrew Collett的更多文章

Call me!

2020年8月5日

Call me!

I was going to start this article with ‘If you’re a multi-national, look away now’ as I talk about contact centers, but…
Backup the Bus

2020年7月27日

Backup the Bus

Today I did the backup of my partner’s PC. She’s running a small business and whilst she’s using online services to…

3 条评论
You're paying

2018年8月28日

You're paying

Before I start on this one, I feel I should say this article is my views based on experiences, training and…
Automate mate

2016年1月23日

Automate mate

I work in an industry that's on the verge of an automation revolution. At least that's what we've been told for well…

3 条评论
The Universe Matters

2015年12月28日

The Universe Matters

I was in heavy discussion the other day on a topic that I’m sure reoccurs in many technology spaces around the world…

See all articles

Practice your Disasters

Andrew Collett

Project Manager | Leader | Team Builder | Problem Solver

Disaster Recovery

Business Continuity Planning

Scenarios

Andrew Collett的更多文章

社区洞察

其他会员也浏览了

Business Continuity Planning (BCP) for Malawi: A Call for Governmental Preparedness Amidst Increasing Disasters

Audit Your Own BCP/BRP

Mastering the Game: Strengthening Your Core with Business Continuity and Disaster Recovery Strategies

Analyzing the Impact of the 16th April Rain in the UAE: Preparedness and Lessons for Business Continuity

United Airlines Disaster and Business Continuity Planning and Response: An Analysis of Three Incidents

September is National Preparedness Month. Is Your Business Prepared?

Is Your Organization Prepared for The Next Disaster?

IT Disaster Recovery Plan and BCP

Gulf News: What Businesses Can Learn From Virus Crisis

Thunder in February, Flooding in May

Disaster Recovery

Business Continuity Planning

Scenarios

Andrew Collett的更多文章

Call me!

Backup the Bus

You're paying

Automate mate

The Universe Matters

社区洞察

其他会员也浏览了

Business Continuity Planning (BCP) for Malawi: A Call for Governmental Preparedness Amidst Increasing Disasters

Audit Your Own BCP/BRP

Mastering the Game: Strengthening Your Core with Business Continuity and Disaster Recovery Strategies

Analyzing the Impact of the 16th April Rain in the UAE: Preparedness and Lessons for Business Continuity

United Airlines Disaster and Business Continuity Planning and Response: An Analysis of Three Incidents

September is National Preparedness Month. Is Your Business Prepared?

Is Your Organization Prepared for The Next Disaster?

IT Disaster Recovery Plan and BCP

Gulf News: What Businesses Can Learn From Virus Crisis

Thunder in February, Flooding in May