Master of Disaster (Recovery)
Coming off a successfully delivered Disaster Recovery (DR) Programme which has inevitably an exceptionally poor starting point, means that the programmes’ timelines were always up against it. Normal scenarios for a lot of businesses are incorrect, incomplete or unavailable often with politically correct words to hide the facts they don’t exist. Perhaps this is a bit too cynical but then again in my experience this is so. Before the preparation for DR can be started in anger, some of the information that is required for a DR includes,
- Base IT and Application data (up to date) held in a CMDB
- Upstream and downstream applications identified and documented
- Service management documents
- ITIL processes at least the basics in place and relevant
- Third parties identified along with contracts. Often Third-Party contracts are silent about DR or DR Testing and no SLAs exist during a DR or worse still not covered
- Software Licenses is always a problem for a DR environment and normally even worse for a DR Test.
Licenses can really be split into 3 areas,
- Those not required
- Those that are required for functional reasons
- Those required for compliance and legal reasons
For example, some licenses require a hardware serial number in order to enable the software. In a DR environment that Serial Number will be different if a physical machine, if it is not dealt with or tested prior to the event and will likely mean lots of last minute calls and favours called in. Just hope the software company is still in business has not been sold and you have the customer ID so they will talk to you.
Some of the issues to overcome, with DR, are High-Level and Low-Level designs missing or out of date. Business Impact Assessments based on politics or historical factors more than need which would impact the assessment. Organisational culture is often poor and aggressive and there is a lack of team ethos. In fact, almost every gap you can imagine from documentation missing or incorrect or out of date will occur. So, we have the proverbial mountain to climb how do we do DR with some senior bod giving a commitment that DR will be ready or some form of test on a ridiculously fast date as a secondary or third level priority? The answer is change the priority of the senior bods either to give appropriate resources, or what will be tested.
Key principles of Disaster Recovery are, frameworks, repeatability and ensuring senior executive sponsorship. Saying nothing new or really special here. This article will focus on post Business Impact Assessment. Hence, after grouping your business functions, business services and IT Services. Business Services and IT Services can be really confusing as they can be a many to many relationship (Back to Service Now and definitions). Hence some simple examples of business services are producing, selling, supporting, or even different groups in a business.
An IT Service is the layer below the Business Service. However, the same IT Service can be used in multiple business services. For example a reporting tool may be used by many different business services, but different business services may use different functionality of the reporting tool. Hence an IT Service may be broken down further to a more granular level to restore points. For example, if a Recovery Time Objective is 15 minutes, 12 hours, 24 hours, 36 hours or 72 hours, then different parts of the functionality may be required as the company will determine via the Business Impact Assessment of what it can do without, where it has work arounds and what are must haves.
Recovery Time Objective (RTO) – how much time you have to restore the service
Recovery Point Objective (RPO) – how much data you can loose
One approach for working out how to plan for a DR event or test splits into 5 main work areas, depending on the organisation base point
Remediation – putting in place what should have been in place. Try restoring systems that don’t work in the first place, I have seen so many examples where for the server was the wrong type and had parts borrowed and never returned years ago, here are a few common ones
- Backup tapes media that don’t work
- Insufficient storage
- Who is the software vendor!!
- Licenses being out of date and the company has been sold
- Software code not working, out of date or hardcoded
- Hard disk errors
- Patches missing
- Different versions of hardware or software from Live to DR
- Hardware or Software out of support or media lost etc etc.
Key point is to get the Team (Vendor and In-house) are to be open and honest as early as possible. Often documentation is incomplete or missing. Hence IT Systems that are key are not known about and only found out too late in the day. The further towards a DR Event or DR Test the harder it is to sort. I have even experienced an Architect say that if something comes up in the test, we will say it was unforeseen, wasn’t it, and it can be corrected later. This was due to politics in the firm and in no uncertain terms I explained it was not acceptable for me to fail as a PM when there was time to correct the issue. The issue was corrected.
Design of the Disaster Recovery environment. There are often 2 sets of designs, one for actual DR and one for Testing. Often the Test can not affect the live environments and therefore complications in the designs occur, with words such as bubble, duplicate IPs and limited testing will need to occur. Normal scenario that high-level and low-level designs will be required along with application mapping and lots of capacity planning. Some key points to try to enforce is to build, like for like, with the same functionality. Also think about roll back from Test/DR site to Normal Operations.
Build Out is the technical delivery making it is all in place. This can be from the Wintel, Unix, Storage, Backup, Tooling and Security, Networks, load balancing, firewalls, remote access etc etc. Just bear the thought just because it is a DR environment it will not be faster to build than your normal IT build speed unless something different occurs.
Service Delivery is split into the various components and is a huge area
BAU Operating Model. Finding out who supports what, and also agree on the name for what can be an absolute nightmare evening in-house different teams can refer to a system by different names or incorporate different functionality into those names.
DR Strategy and Plan detailing what, why and the how the organisation DR function will occur. Often Business Continuity is linked heavily here.
IT Technical Recovery Plans (TRP) - These are often referred to as TRPs. Based on the operating model and Business Services in scope. When looking at restoring a service, the core principles are almost a tower in view, Core Generic functionality (storage, networks, backup, foundation services such as, AD, Anti-Virus, Jump Servers, Proxy etc etc.), OS, Databases, Middleware, Tools, Security, Clustering. Many people think this will restore a system it will not there are whole layers about it. A TRP is a basic document how you restore the Infrastructure and prove it is restored. Possibly some work arounds, lots of templates exist, mostly way too complex. Simply put the team need to know how to restore the product and what to do if things go wrong, such as who and when to tell. Too much waffle and teams can end up debating during a DR Test / Event and that kills the RTO (Recovery Time Objective). The old adage of keep it simple is really good here. The key point is that some form of testing at this level needs to occur to prove that the Technology works, this is during the event. When recovering there are principles to the order of restoration in terms of prioritisation. A guide to restoration is that at the same level of priority restore servers based on the following order tools, databases, web/applications.
Service Recovery Documents (SRD) – This can be broken into 2 layers depending on complexity, either at the IT Service or the Business Service Layer. In principle once a subset of the Infrastructure is available, the IT Services can be restored. Hence it could mean for example that the Core Infrustructure is available, an OS is available, a Database such as SQL is restored. This is typically done either by a TRP or a hot environment (always on and available). The application can then be restored. You may need a group of IT Services to be able to restore a business service. In principle the SRDs should be independent and therefore not needed to be restored in a specific order, as operation validation later on would integrate into a business service. However, pragmatism does means that sometimes this cannot occur as functionality is dependent on other functionality. What’s more an IT Service can be used by multiple business services.
Note that even after SRD is run, Operational Validation is required which can include batches and Middleware data. Sometimes for example for mainframes the batch jobs need to be identified and depending on when service stopped will need addressing If a partial batch or dataflow has occurred then decisions on what needs to occur a rerun, repair, or nothing, and is the business prepared to accept the decisions.
Licenses – quite often this a sore point in a lot of firms. They either have licenses that are out of date, software vendors may not realise the firm has a license. The licenses may not cover DR or DR Tests or even know if they use them. There is the politics if you go to a supplier and say am I legal for a DR or a DR Test they will want to charge a lot of money. Hence there is often a large activity to identify licenses used on an estate, find out who the vendors are. Identify any license agreements and confirm if those license agreements say anything about DR or use during a test. Worse if the firm changes name or changes who the third parties are for supporting the software then the Vendor may say the License is invalid. Hence there are ways and means to contact Vendors to confirm, often upgrades are required both in Production as well as DR. Furthermore, before going down this route it is advisable to check that the license is required by confirming its usage if it has been used for a day, a week, a month, a quarter, half year or year. There is little point in restoring software if it is not used. In one case I checked mainframe licenses used and whilst there were over 7000 instances of licenses only 350 were used. Hence the Programme identified there was an opportunity to reduce the complexity on the estate.
Third Parties is another huge area and can take time to cover off. Often third parties do not have anything about providing higher level services during a DR event or Test. The third-party agreements, may be location specific, may not cover the full functionality. Third Parties may even need to have their own DR Environment if they are providing services, for example Cloud based services. Key points to consider are that commercial arrangements will take time especially contractual negotiations, timing is often important in terms where Third Parties are in time based contract. They are often more agreeable towards the end of a contract due to getting the renewal. However, it is important to understand where the third parties fit in and potentially grade them in terms of key and required during a DR Event or Test, or potentially required for support. The cost of additional support is often back to the appetite for risk the firm wants to adopt.
How will the DR be Managed during the event is an often over talked and over hyped scenario? Lots of Gold, Silver, Bronze teams and discussions will take place. During a DR Event there are likely to be print outs and tick sheets and some will even have coffee and food available with war rooms and little used posters around the room. Prints out of plans and reports are useful. However, resources need to be available to populate during an event. Key things during a DR Event will be
- Meetings to management (time barred with resources)
- Communications of success, failure and current position
- A predefined status report, every 2 to 3 hours for Management consumption
- Clear success and reporting criteria agreed with management prior to the event
- Excel and or Microsoft Project Plan (MSP)
- Issue Tracker and status
- Lessons Learnt Tracker
- Breakout rooms for addressing issues
- Clear leadership and ownership of resolutions
What is key is that senior managers do not get in the way and hinder progress by having long discussions with lots of resources who could be progressing the plan. Some of the key risks affecting a DR Event or test are management requiring information not planned to be given. When creating the Excel or MSP identifying recovery groups and dependencies is critical, as is the order of restore in terms of timelines. Finally, how work is initiated during an effect and the style type of communications is really important as is updating and documenting the plan as you go along. Guaranteed post recovery someone will ask did you recover it in time and therefore this needs to be captured.
Testing is another areas within DR not often thoroughly considered. Each area should have Testing. Hence HOT (always on), Warm and Cold infrastructure, Applications, Integration, Operational Validation, Business Testing and sign off that it Is usable and acceptable. One of the problems with testing is the goldilocks effect. Not too much and not too little, along with who can test it.
There is one final thought when looking at DR Environment Pragmatism. However good the planning is, things will not work out. Make sure as a PM someone is documenting timelines, and lessons learnt good or bad as you are going along. Plan it in. if you don’t, I guarantee you will get failure out of the jaws of victory, as post events people’s memories change or forget.
Jason Douglas is an experienced Programme / Project Manager
If you would like to connect on LinkedIn please do so
LinkedIn https://www.dhirubhai.net/in/jason-douglas-2376832/
Email [email protected]
Other articles written by myself are
Vendor Management for Cloud - Customer Fights Back
Third Eye of Project Management
Stress Above the Rainbow - Northern Style
In Flight Programme Handover- Northern Style 101
Laboratory Technician at Garstang Community Academy
5 年Good article, sums up the reality of working on complex projects.