Disaster Recovery Approach and Solution
Mathangi Shankar
Chief Architect FS, India | TOGAF Enterprise Architect | Art lover | Career Mentor, Senior Digital Solution Architect , Portfolio Manager at Capgemini
1) What is Disaster Recovery? Why do we need to go for Disaster Recovery?
Disaster Recovery is usually planned for any disaster in the production IT server systems. Where Availability is one of the Non functional requirements asked from the business. So we come up with Disaster Recovery site which is almost the replica of Production system/Application. Where Production switching to DR site happens when the Production outage occurs. Disaster Recovery solution and planning must be planned and approved by all the stakeholders well in advance. Hence for the end users they are not affected of the downtime of the applications. Always keep the Disaster Recovery site isolated from the Production systems. Decision for Active- Active or Active-Passive for the sites are also considered in the plan. Hence it is required that we have Disaster recovery solution in place so that the business does not suffer.
2) Disaster Recovery Vs High Available solution:
There is normally a confusion between Disaster Recovery and Highly Available solution. Both the terms are different. Highly Available solution is making your IT system highly available and handling fail overs. It is generally taken care in Distributed computing and regional deployments. How are you making your solution highly available in a particular region? So you go with solutions like different nodes in a cluster. Even if a node fails the other node in a cluster should be able to pick the request and perform the operation. While Disaster Recovery is completely different, this is planned for a outage of the complete region, there should be another site which can be made available so that the users are minimally or not impacted.
3) Which are the basic elements you would consider in a Disaster Recovery Plan?
3.1) You will have to make sure your solution has a disaster recovery automated or manual is in place. If there are products included in your solution you may want to check if they provide automated synchronization from PROD to DR sites. If not check for manual intervention in which case the RTO and RPO values might be high. Every component in the Architecture has to be available in the DR. The environment , configuration, installations, image back up /tape back up/ file system back up and many others to be planned ahead on an agreed frequency by the team. Additionally PROD and DR activity to be performed ahead to identify the accurate time for RTO and RPO so that we can inform the business accordingly.
3.2) Second check will be on the networks for the PROD to DR switching. A drill might be done and the time taken will be noted.
3.3) If your solution needs support team during the disaster recovery have them identified with their Roles and Responsibilities.
3.4) Have a checklist and a plan handy in a document. Though this is not really BCM.
4. Disaster Recovery solution, Automated Vs Manual
Consider replication solutions for your products in the solution if any. For example if we consider Oracle database we might go for Data guard replication or any other third party replication tools. If you have open source databases like PostGres check for the replication solutions available. In any case you do not have then manual intervention might be required where we write scripts or do it manually in a sequence, if you have JOBS and ETL solution for example. Back up and Restore one time will not replicate your DR site always. You will have to further visit your Disaster Recovery solution.
5. RTO(Recovery Time Objective) vs RPO (Recovery Point Objective)
- Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are two of the most important parameters of a disaster recovery or data protection plan. These are objectives which can guide enterprises to choose an optimal disaster recovery plan.
- Recovery Point Objective (RPO) describes the interval of time that might pass during a disruption before the quantity of data lost during that period exceeds the Disaster Recovery Plan’s maximum allowable threshold or “tolerance.” While this definition is a refereed one , in simple terms how much data loss your business can afford is what is meant here.
- The Recovery Time Objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business IT solution. In simple terms how soon you can recover your data to the DR site.