Which is more scary: a disaster or your disaster recovery plan?
Most organisations above a certain size have technology disaster recovery plans: plans for what they will do when something goes wrong, such as a fire, flood or power failure. These plans are often elaborate and expensive, involving redundant equipment and facilities in geographically separated locations. Organisations don’t run pairs of data centres because they enjoy running data centres: they do so against the day when one of the data centres isn’t there any more.
However, despite all of this preparation and expense, many organisations have two problems which mean that their disaster recovery plans may be no use in a real disaster. First, their disaster recovery plans aren’t really plans to recover from disasters. Second, their disaster recovery plans are just plans.
Disaster recovery plans that aren’t really plans to recover from disasters
Many years ago, I worked for a mid-sized organisation that coudn’t afford redundant data centre facilities. These days, we would have hosted everything on a cloud platform, but this was before such platforms existed. Instead, we had a recovery subscription service. This meant that, in the event of a disaster, we would turn up to someone else’s facility with many boxes of tapes, they would give us equipment to a defined specification, and we would attempt to restore our systems.
And, once a year, we rehearsed this process. It was usually fraught and difficult, and we encountered new challenges, but we always got there within our allotted time window. It was even a little fun.
Except that on the day a disaster actually happened, our plans didn’t work. A major incident threatened to disrupt power, transport and access to facilities in part of the city where our offices were based. We quickly picked up the phone to the shared services company, and they told us that all of their other customers in the same area had done the same thing. Their facilities were already booked up, and the nearest place we could recover to was in another country. It would take days to ship our data and people to that site, and we would miss our recovery window.
Fortunately, on that occasion the disruption was less than we feared, and we were able to carry on operating in our data centre. But we had learnt an important lesson: our disaster recovery plan was not a plan to recover from disasters. Rather, it was a plan to execute a successful disaster recovery test - and a test in controlled circumstances.
Many disaster recovery plans are similar: they include tests which are supposed to prove that they work, but do no more than prove that the test can be executed successfully.
领英推荐
Disaster recovery plans that are just plans
However, this does not mean that we should not do tests. At another point in my career I was working for a much larger company, with several data centres, data replication and redundant equipment. Part of the design process for every system was to agree requirements for recovery, and to figure out which standard we should follow: hot/hot, hot/warm, hot/cold or any other combination of temperatures. Part of the production acceptance process was to prove that recovery for the system worked. With processes like that, it would be reasonable to expect that we had disaster recovery fully proven and under control.
Except that, while recovery had been proven for every system in isolation, it had not been proven on a larger scale, such as a complete data centre failure. There was a plan for such an event, but it was so complex that it had never been tested in full: the belief was that building the systems to test the plan was prohibitively expensive, while running the test against production systems was likely to cause catastrophic failure. The risk register for the organisation even recorded the decision that attempting to test the disaster recovery plan was higher risk than experiencing a disaster with an untested plan.
Perhaps that decision was right: the organisation has been fortunate enough not to experience a disaster of that nature, and perhaps it never will. But if that day comes, then it will be the same that they find out whether their plan works.
How do we address these problems, and make sure that our disaster recovery plans work in practice as well as on paper? This is a difficult question: we are at the sharp end of engineering and risk management at scale. But I think that there are three things that we can do.
First, just like in real life, we can find out how our plans perform when we don’t know what’s coming. Rather than running rehearsals which we have prepared for and which have been signaled in advance, we can run drills and scenarios designed by a separate team and launched by surprise (to at least some people).
Second, we can shift our attention from recovery to resilience. As we move our workloads to cloud platforms, we bring options within reach that were previously out of our grasp: we can have geographically distributed, load balanced architectures without having to build whole new data centres. The best disaster recovery plan is one which requires no action.
And finally, we can be honest with ourselves. We can recognise when we are creating disaster recovery plans because our standards say that we must have disaster recovery plans, and when we are conducting rehearsals because our auditors will check whether we have conducted rehearsals. And then we can do the rather more interesting job of designing for the unexpected.
(Views in this article are my own.)
Open Group Certified Distinguished IT Architect | IBM Cloud Cross Portfolio Product Manager @ IBM
1 年great article and interesting thoughts and observations, David Knott.
Mostly Retired
1 年Resilience theatre is a great phrase David.
- securing the art of the possible
1 年I’m wondering what proportion of DR+BC plans end up just being beautifully written shelfware? I’ve been lucky to work at some great organisations where this was really taken seriously and the plans were physically enacted annually, not just as a desktop exercise - guess what, it certainly highlighted deficiencies that were then funded to get remediation
Cloud Consultant and Frugal Architect - 15 ?1?7x AWS certified , ???????? Solutions not Platforms. (opinions are my own).
1 年I’ve weathered my share of disasters and recoveries and I can certainly say none of the disasters went as planned.