The Structure of Disaster Recovery
The structure of a Disaster Recovery Plan
This article describes an effective template from which a working Disaster Recovery Plan (DRP) can be developed. It details the sections and content of the DRP and the sources of the information required therein. The article also examines the disciplines needed to keep the DRP active and up to date. The template complies with the corresponding Business Continuity standard ISO22301 but is not restricted by or to it.
Step One of Five: Ground Rules
Start by setting the ground rules, such as:
1. The DRP must:
a. respect the business’s Security Policy
b. be sponsored by chief management (CEO etc.)
c. involve all parts of the business
d. manage the disaster event from detection to resolution
e. name and describe all roles in the DRP
f. name the occupier of each role, their deputy and contacts
g. reflect the actual business recovery needs and priorities
h. link to the business Back-up & Restore strategy
i. link to any relevant 3rd Party DRP
j. meet all relevant contractual obligations
k. be subject to frequent review and test
l. must respond to an agreed range of disaster scenarios
m. be kept up to date via the Change Management process
n. be concise: the DR, the whole DR and nothing but the DR
2. The DRP must not:
a. contradict ISO22301
i. In this article that ISO will be referred to, but only as a guide and not as a rule requiring formal Certification.
b. just be put on a shelf to await the next audit
c. rely on complex bespoke DR planning software
i. Such software may be useful, but reliance can result in an unnecessary addition to the complexity of the Recovery Plan. Cost v Benefit analysis is required here.
d. do harm to services when it is executed in Test mode
Step Two of Five: Sources of a DRP
Define the necessity for a DRP in terms of the actual risks to your organisation of the loss of business continuity. That loss means that the organisation has ceased to function. Develop that definition in the context of your organisation’s Security Policy and the sources you will use to develop the DRP. Here is a summary of the way that definition should be built from its sources:
1. Company Security Policy
a. The Policy defines the mandatory requirements for operational behaviour in terms of the use of, or reference to, the business systems and the data they contain. It may be a disciplinary offence to ignore or violate the requirements even if the transgression resulted from ignorance of the Policy. The use of systems and accesses to data will be monitored for unexpected or suspicious activity.
b. The requirements cover such aspects as:
i. Control of access to all systems and data. That control includes maintaining up to date security profiles for all individuals needing access.
ii. Password composition, protection and update rules
iii. Continual compliance with all relevant legislation and regulation such that related to Data Protection
iv. System, data and privileged information to be used only for agreed and legitimate purposes.
c. For the DRP this means that the DRP must build an operable version of the normal production system that complies with all aspects of the Security Policy. There is a difficulty here in that the test version of the DRP needs to be loaded with some level of data in order to provide a realistic recovery test of the systems. It may be, from a Data Protection viewpoint, that live data must not be used for testing purposes. To resolve that the Company must either seek documented permission from the Data Owner to use live data for test or the data used must be a redacted version with all personal Data Subject information deleted.
2. Risk Assessment Report
a. The Risk Assessment defines the likelihood that external events will compromise business continuity. These events do not include the commercial risk of business failure due such difficulties as poor product selection, lack of funds or hostile takeover.
b. The definition covers such possible threats as:
i. Fire
ii. Flood, weather storm, earthquake
iii. Loss of national power
iv. Civil unrest
v. External infrastructure failure including Cloud services
vi. Pandemic
vii. Cyber attack
c. For the DRP this means that each possible threat becomes a potential disaster scenario against which to test the DRP
3. Business Impact Analysis
a. Business Impact Analysis defines the cost to the organisation of the loss of each internal business process
b. The definition covers for each process the impact of service loss:
i. Financial – the increasing loss over time caused by the service interrupt
ii. Reputational – the loss of prestige and trust in the eyes of the customer base, competitors and the Media.
iii. Legal – the lack of delivery to the customer base may have contractual implications. There may be implications too of prolonged service outage leading to an inability to pay suppliers or even staff.
iv. Data integrity – prolonged outage may also cause stored data to become out of date and thus increasingly inaccurate if not obsolete.
c. The definition also covers the recovery objectives
i. Recovery Time Objective – the amount of outage time the business will tolerate before recovery is completed.
ii. Recovery Point Objective – the amount of data, in processing time, that the business will tolerate loosing as a result of the recovery process.
d. For the DRP this means that the BIA can be used to sequence the recovery, so that the most vital prosses get recovered first.
Step Three of Five: Building a DRP
In the life cycle of the DRP there are three distinct phases
1. Building the DRP…a one-off activity
2. Using the DRP … a regular test activity, except in a real Disaster
3. Maintaining the DRP ... a continual activity
This step covers the Building of the DRP. For DRP support an infrastructure of teams is required. In the DRP the members of each team and their deputies are identified together with their roles, skills and full contact details. These teams, explicitly identified in the DRP, are:
· Damage Assessment Team (DAT)
o Tasked with assessing the actual and potential impact of the threat. If the DAT declare a Disaster then the CMT is activated. Otherwise normal problem management procedures are used.
· Crisis Management Team (CMT)
o Tasked with running the overall recovery process and controlling status reporting both internal (the Business) and external (customers, suppliers, Media)
· Disaster Recovery Team (DRT)
o Tasked with running the detailed recovery process. Typically this is a combination of technical and business staff. The DRT takes instruction from, and reports to, only the CMT.
The Build is a direct consequence of the function of the DRP which may be summarised as a sequence of activity as follows:
1. Disaster Event detection and declaration
a. Who is involved ?
i. Initially whoever discovers the potential threat whether external (see Risk Assessment) or internal (see BIA)
ii. Secondly problem management.
iii. Finally, if the issue is beyond standard problem management then the DAT become involved.
b. What do they need to do ?
i. The discoverer and their management need to promptly report the threat event as clearly as possible to problem management and to the emergency services as appropriate.
ii. Problem management needs to rapidly assess the event report and decide whether it’s beyond normal management. If so then the DAT is alerted.
iii. The DAT need to rapidly but carefully assess the overall situation taking internal and external advice as required.
iv. If the DAT declare a disaster they need to activate the CMT.
c. When do they need to do it ?
i. The Discoverer needs to act immediately
ii. Problem management needs to act as soon as the event is reported.
iii. The DAT needs to act as soon as alerted by problem management
2. DRP Communications
a. Who is involved ?
i. Once Disaster has been declared by the DAT and the DAT has activated the CMT, all communications are made and controlled by the CMT
b. What does the CMT need to do ?
i. Activate the DRP to start the recovery process.
ii. Establish an operating bridge as the single point of contact for all activity and status during the recovery process.
c. When do they need to do it ?
i. The CMT needs to act as soon as a Disaster is declared by the DAT.
3. Technical recovery procedure (fail-over). This procedure varies according to the nature of the disaster and the IT recovery architecture of the organisation. As noted above the test disaster recovery scenarios are derived from the Risk Assessment report, and the recovery sequence from the BIA. It may be that part or whole of the Technical DR process is handled by a 3rd party supplier. In that case the CMT liaises directly with that 3rd party. Where that is not the case then the Failover-Failback activity (a-b-c and Section 4 below) remains partly or wholly relevant.
a. Who is involved ? (many, so listed here are skills not people)
i. DR Architecture
ii. Facilities Management
iii. Hardware and Software buy in
iv. Hardware build
v. System software licencing and build
vi. Back-up and Restore
vii. Network connectivity and resilience
viii. System access security
ix. Business application support
x. Data base administration
b. What do they need to do ?
i. Deploy those skills as and when required by the DRP
ii. Document a recovery procedure out of the work they need to do as part of DR:
1. What needs to be in place before they start
2. Exactly what they do and how long it will take
3. How do they know it has worked
4. What else in the DRP can then be done
iii. As you can see when the above b(ii) 1-2-3 is complete for all required skills then a draft DRP Technical recovery section is complete. Remember that in the DRP, people must be related to the skills with their full contact details and deputies.
iv. Define the accesses and authorisation needed to run the technical recovery. The DRP may contain the actual User Ids and passwords required but it is suggested that the DRP simply points to a secure procedure that defines the access. This caters for the possibility that key recovery staff are unavailable as a result of the disaster event.
c. When do they need to do it ?
i. As part of the build of the DRP
Business recovery procedure. This procedure is a constant across all threats and architectures. It involves assessing the data integrity and operability of the recovered systems.
o Who is involved ?
§ Business users of the systems
o What do they need to do ?
§ Create test cases designed to assess the quality of the recovered systems
§ Document those cases as part of the DRP
o Define the accesses and authorisation needed to run the business verification. The DRP may contain the actual User Ids and passwords required but it is suggested that the DRP simply points to a secure procedure that defines the access. This caters for the possibility that key business staff are unavailable as a result of the disaster event.
o When do they need to do it ?
§ As part of the DRP build
4. Technical fail-back procedure. This procedure is a mirror of the fail-over procedure, returning the systems to their original place. This procedure is only required if the organisation is returning to its original site or service supplier. To verify that return, if it happens, the Business recovery procedure is re-run.
a. Who is involved ?
i. Technical and Business recovery staff
b. What do they need to do ?
i. Prepare the input to the DRP based on the technical fail-back and Business recovery procedure
c. When do they need to do it ?
i. As part of the DRP Build
Each of the above 1-5 appears as a section in the DRP. It is recommended that a redacted version of an existing DRP is used as a model. These are widely available on the World Wide Web. Use of them avoids the problem of starting with a blank page. Search “ISO22301” provides a good guide.
Let’s move on.
Step Four of Five: Using a DRP
There are two distinct versions of DRP use:
1. DRP Test
a. The prime objective is to find fault
i. Fault in the content of the DRP – information such as roles, contact details, technical procedures are shown to be incomplete, inaccurate or obsolete
ii. Fault in recovery team understanding of the recovery process and/or the importance of that process
b. The secondary objective is to correct those faults as they are encountered and then re-test to prove the correction
2. Live DRP – there has been an actual declared disaster event
a. The prime objective is the recovery the systems as cleanly and as quickly as possible, meeting the Recovery Time and Recovery Point Objectives.
b. The secondary objective is again to trap DRP faults but then just get round them and formally update the DRP later
In test mode the DRP is subject to two kinds of verification
1. Desk Top Check
· All the DRP documentation is reviewed by the various subject matter experts, all persons identified within the Plan and Business Continuity staff
i. Is the documentation correct and complete ?
ii. Will it achieve the Recovery Objectives ?
· Update the documentation based on the review and enter a review-update-review cycle until there is a general and high level of confidence in the viability of the DRP
2. DR simulation
· The DRP is used directly to build and run a recovered version of the “failed” systems. This will exercise and verify the Communications, Technical and Business elements of the DRP (see Steps 2 and 3 above)
· Update the documentation again based on the simulation results and enter another review-update-review cycle until again there is a general and high level of confidence in the viability of the DRP
Step Five of Five: Maintaining a DRP
Maintenance of the DRP is driven by five requirements
1. The results of Desk Top Check, DR simulation run or live recovery.
2. Auditory requirements and (optional) compliance with ISO22301.
3. Changes in the Business that cause updates to the RTO or RPO of existing applications or the addition of new applications or the removal of obsolete applications.
4. Changes in the technologies used by the resilience and recovery processes.
5. Changes in the use of 3rd Party DR service suppliers as part of the DRP.
Given that the DR capability system is a vital part of the overall production system its maintenance must be controlled by formal Change Control Management. Each time a change is requested then that change should be reviewed for any DR implications. Like-wise if a change to the DRP is requested then its implication for the rest of production should be considered.
I hope you found this interesting and useful.
Do respond as I’m trying to make sure that I have my answers questioned.
Comments welcome on [email protected]
Roger Jarvis MBCI, Fulham, London. January 2021