Practice your Disasters
We’ve all had a disaster or two in both professional terms and life I’m sure.
I’m no good at the life counselling thing, and whipping out a Disaster Recovery (DR) plan and Business Continuity Plan (BCP) when Daughter #1 splits up with boyfriend #4 doesn’t work, fact. Though having a stash of pringles handy and a forced re-watch of National Lampoons is a good plan to have.
But like all good plans a bit of planning before you have a plan is a good plan, and this article is about that, the basic interactions between three aspects of these plans and how you can enact them for your team or business.
Disaster is inevitable in some form or other, who’d have thought we would have the disaster of medium-long term limited office access (how’s the printing going), or more typically, if you’re moving an office you can treat the lack of access as a short term no access.
I treat DR and BCP in the same structure mostly as they very much work together. Simply put, if your ICT team is experiencing a disaster it’s likely that the rest of the organisation isn’t getting on with business, no continuity as it were.
And this is the lead into the first point, where does all this sit. I’ve found that DR is largely in the ICT domain simply because there needs to be structure in place to ensure a return to service in an orderly fashion, coupled with great communications. Business Continuity is with, logically, each business unit to ensure they’re able to work. Scenarios are really an outcome of the BCP process and serve to feed the DR planning.
It all looks a bit like this:
Disaster Recovery
A trap for many is to put in place a DR plan for say flooding of the datacentre. That’s great but take it up to the top level. This is a loss of service, likely critical ones and that is what needs to be addressed.
Using a table like the one below (the key is the arrows) you can work through with your ICT team what’ll happen and when. And you only need one of these tables to cover your critical systems. I did one for critical and one for non-critical and that was our DR largely done.
The way it works is to have your timeliness on the rows, it can have the spacing that works for you. I found that terms like: 0-5mins, 5-15mins, first half hour, within 1 hour, worked for those involved and clearly articulated what is happening when to the audience.
The second axis is the who. In this above example, I’ve included monitoring – which really is what’ll tell you that you have an issue. From there it’s the doing teams through to senior leadership. You can of course include external parties in this model too.
Within the sheet itself is the great bit, you can cover what is happening when, and with whom. It’s likely your ICT team will want to investigate for some time before escalation and switch to an alternate backup solution, after all, the issue may be in configuration or connectivity (both of which could transfer the fail).
And it’s likely the cost of switching solutions in a disaster scenario is one thing, but the effort to return can be just as hard.
You end up with a relatively simple spreadsheet that outlines who’s doing what to recover, if not the outright ‘press this, push that’ commentary (that comes later).
Business Continuity Planning
When asked to do this some years back I was concerned. I knew the ICT side of it, but that’s not the service chain, just one aspect of it.
And so I constructed a number of high level fail situations (No site access, No Database, etc) and asked business experts to think about what they would do to continue working for each.
Things I asked to be thought through included:
- Impact on the business (Service/Financial/Anything else)
- What mitigating actions were appropriate
- Changes required to enable this and the target date to deliver (always have a date!)
The resulting spreadsheet ended up looking a lot like this:
We went through everything from one staff member not available for less than half a day through to the exec team disappearing for good, and simple system issues through to full and final failure. All in a simple matrix.
For each business unit I tend to say allow 1-2 hours to go through the interview, and at the end of it you’ll have a picture of what BCP is in place and what’s needed to ensure continuity in the event of a disaster.
A good example would be where a team couldn’t live without email for any real length of time. So when email went down for over a day the BCP was to communicate an alternate contact method (plausible in the systems we had). The action? Setup a Gmail account and have that on standby.
There’s a lot more to the structuring of this, but enough here, I think.
Scenarios
Once you have some high-level BCP situations covered and have a DR plan in place you can look to how you recover from specific situations faster.
Earlier I mentioned no access to the office. This is a BCP issue and should have mitigations in place. But it can be treated as a disaster, and probably should, knowing what to do in these scenarios certainly helps recovery.
I have a template, the contents of which are:
All fine sections I hope you agree. There’s a fair bit in each scenario document, but the effort here pays off when the "D" hits and you need to "R" as fast and as safely as possible.
To ensure you have the scenarios covered build a list of what can go wrong and where, perhaps not ‘flooded datacentre’, but certainly ‘loss of key database server’ or similar.
And this is where the title of this article comes in. Practice your disasters.
With the above three components in place, you can roleplay an outage, either in terms of running through the paperwork, or, as I did by fully experiencing the journey with the MD…
We decided to roleplay a critical outage when we were releasing an update to a beta test server. Essentially everything lined up with the live environment. The resulting journey through the DR plan and engagement with teams became all too real!
I strongly recommend this approach as it was equally scary and enlightening to have senior management get into the spirit of treating it like a full live event!
Fulfilling the testing means you can refine your DR plan, improve your BCP and review your scenarios as you improve, the cycle is complete.
Happy to discuss anything in this article as always. Drop me a line.
This article is my views based on experiences, training and observations of far too long in the technology arena. Your views, experiences and opinions are yours, valuable and equally pertinent. They’re just not mine and it’s easier to write about the stuff I know!
Agile Delivery Lead - Enterprise Applications Management
4 年Thanks Andrew, That’s a good read