Business Continuity and Disaster Recovery Strategy for your IT Landscape
Few years back, Delta airlines suffered a critical IT infrastructure outage. There was a severe delay in the backup systems kicking in, which cost the airline over $100 million dollars in lost revenue – not to mention the reputation damage.
In 2021, the Irish Health Service Executive (HSE) was struck by the ransomware, bringing the healthcare system to a standstill. The result of this attack was quite significant, as it forced the HSE to shut down dozens of outpatient services and rendered its payroll system inaccessible, leaving 146,000 healthcare workers without pay for some time. The attack also caused the COVID-19 vaccine portal to go offline and induced IT outages at five major hospitals, including Children’s Health Ireland.
The scale of the attack was so great that not even the HSE’s cybersecurity protocols could fully prevent it. They needed to shut down more than 85,000 computers and inspect over 2,000 IT systems throughout the HSE system to ensure that the spread of the ransomware could be contained. Full restoration took three months, and it became possible only because the hackers suddenly posted the decryption key online.
Examples of business falling due to natural or man-made disasters are plenty, but the organizations fully prepared for to handle such threats and risks are still not significant. As per one of the recent survey, only 54% of organizations have an established, company-wide disaster recovery plan and only 57% of surveyed companies have a second on-prem?data center dedicated to disaster recovery.
Today companies are exposed to potential threats and risks more than ever before, which may disrupt their business continuity and result into tremendous amount losses in terms of money and reputation. To ensure that the organizations are ready to recover from the impact of natural and man-made hit calamities on business, organizations must follow below detailed plan on Business continuity and Disaster recovery.
Business Continuity and Disaster Recovery Plan
We have designed a 4 step BCDR plan which organization should ideally follow to ensure their readiness against the crisis and overcome it without much impact. This plan is a holistic plan which starts from assessing risk and impact of the critical assets, analyzing the parameters for picking the right BCDR strategy, understanding the detailed implementation approach for different BCDR strategy and calling out aspects and importance to test and maintain DR plan in different business stages ?
BCDR is a 4 step plan which an organization need to follow
1.?????? Risk and Impact Assessment
2.?????? Adopt Best practices to Mitigate Damage ?
3.?????? Selecting and Implementing BCDR strategy
4.?????? Testing and Maintaining DR Plan
?
1. Risk and Impact Assessment
?
Risk Assessment
Any organization should consider all the potential threats ranging from natural calamities (earthquake, hurricanes etc.) to man-made business disruptions (cyber-attacks, system failures etc.). Next step is to prioritize these threats on the basis of correlation between likelihood of a given threat and the impact it might cause.
Risk = Likelihood of the event * Impact of the event
The vulnerability which has the high-risk assessment score should be given highest priority and the mitigation plan should be made as per the risk assessment index.
?
Impact Assessment
Impact assessment is used to determine which assets of the companies will be impacted to what extent due to any vulnerability and potential threats identified. The impact on the servers and IT assets will have potential service impact, customer impact, revenue loss and reputation damage. Detailed documentation on the quantum of impact on all dimensions due to various vulnerabilities is prepared. Impact assessment should ideally be done post risk assessment to so that the impact for high-risk vulnerabilities is measured and right BCDR plan and strategy is selected.
?
2. Adopt Best Practices to mitigate damage.
After identifying the critical applications, servers and IT assets which will cause a maximum impact, organization should design its strategy which will mitigate the risk on critical assets first and then mitigate and recover non-critical assets. While selecting the right BCDR strategy, organization should adopt some best practices which will help them mitigate the damage during a disaster.
1.?Geography - A geography is a discrete market, typically containing at least one or more regions, that preserves data residency and compliance boundaries. Geographies allow customers with specific data-residency and compliance needs to keep their data and applications close. This lets the customer keep business-critical data and apps nearby on fault-tolerant, high-capacity networking infrastructure.
2.?Local Redundant Storage (LRS) & Geo Redundant Storage (GRS) – Organization should always have a cloud storage policy which stores multiple copies of data so that it's protected from planned and unplanned events, including transient hardware failures, network or power outages, and massive natural disasters. Redundancy ensures that your storage account meets its availability and durability targets even in the face of failures.
Local Redundant Storage - Copies data synchronously on cloud, three times within a single physical location in the primary region. It is the least expensive option available.
Geo redundant storage (GRS) - Copies data synchronously on cloud across three availability zones in the primary region.
?
3. High Availability - High availability factor is a company level SLA which determines how much data loss an organization can tolerate (Recovery Point Objective [RPO]) during the downtime and how much duration of time is permitted between the event failure and restoration (Recovery Time Objective [RTO]). BCDR plan must be defined as per the high-availability SLA for each critical services and assets that the company has determined.
3. Selecting and Implementing BCDR Strategy
This phase is one of the most important since we need to select and implement the BCDR strategy after determining the criticality of application and the parameters around it (second step of the plan)
The baseline from a regulatory compliance perspective and must be applied during the design phase. The risk vs reward appetite of the business must be clearly agreed upon and documented accordingly during this phase. There are three BDCR strategy that the organization can select after considering all the factors and assessments given in previous steps.
BCDR Options (Overview)
1.???? Active – Active
The business-critical application is designed to receive production load in multiple regions. The cloud services in each region need to be configured for higher capacity like prod environment. This approach requires a large investment in application design, but it has significant benefits. Benefits include low and guaranteed recovery time, continuous testing of all recovery locations, and efficient usage of capacity.
2.???? Active – Passive
In this approach, we create a secondary hosted service in an alternate region and deploy roles to guarantee minimal capacity. However, the roles don't receive production traffic. This approach is useful for applications that have not been designed to distribute traffic across regions. This approach requires less investment compared to Active-Active, since we are deploying resources in DR with minimal capacity to execute an application in case of disaster event. Active-Passive approach can also be addressed in different sub-approach given below, based on the company’s business case and requirement. ?
a.?????? Warm Standby
b.????? Active - Silent
c.?????? Backup & Restore
Let’s understand each DR strategies in detail.
1.?????? Active – Active
In this case we need deployment of all resources required to execute an application in two different regions and use regional load balancers [Traffic Manager/Front door etc.] to route the traffic. For data we need active data replication between 2 regions so that we will have active data in both the regions.
App Criticality & SLA for Active - Active DR Strategy
领英推荐
Points to Consider
1.???? Active – Passive
?
a.? Warm Standby
In this case we need deployment of all resources required to execute an application in two different regions and use regional load balancers [Traffic Manager/Front door etc..] to route the traffic. For data we need active data replication between 2 regions so that we will have active data in both the regions. In this case, DR site need to have a deployment of minimum required resources to execute the application in case of DR event.
App Criticality & SLA for Warm Standby DR Strategy
Points to Consider
b.? Active – Silent
In this case we need deployment of all resources required to execute an application in production region and in DR region deploy the resources and stop the resources. In case of DR event we have to manually/automatically activate the environment and route traffic to DR region. No Active data replication to DR region.
App Criticality & SLA for Active - Silent DR Strategy
Points to Consider
c. Backup & Restore
In this case we need deployment of all resources required to execute an application in production region and must ensure that we have necessary backups in case of any DR event. This is very cost-effective solution used for non-business critical applications.
App Criticality & SLA for Backup & Restore DR Strategy
Points to Consider
4. Testing and Maintaining DR Plan???????????
DR testing is a periodic process to test the adopted DR strategy to ensure that the system is resilient and capable enough to mitigate the impact of the disaster. The IT systems keeps changing with upgrades, new deployments and changes in infrastructure. Periodic DR test is important to ensure that the adopted DR plan stays relevant in in changing IT landscapes of the organization.
?
Some of the following important aspects of DR Testing are:
1.?Frequency of DR test – If the period between two DR test is long, the risk of IT systems failing DR plan will be more. It is very important to measure the complete recovery SLAs and methodologies periodically so that the DR plan and approach stays relevant along with the growth and change in business and its IT landscape.
?
2.?Major Infrastructure Change – Organization must have DR testing done after major infrastructure change like Cloud migration, change in storage hardware, upgrading hypervisor etc. Any such infrastructure change may require DR process amendments.
?
3.?Impact of DR Testing – Organization must assess the impact of DR testing on other live environments before conducting DR testing. DR test may cause downtime of a database and many applications can get impacted due to this. All the relevant software upgrades should be outside of the DR test window. All the scenarios should be assessed before conducting DR testing.
?
4.?Time window of DR testing – To avoid any disruption of services and reduce the customer impact, DR testing should be done outside the business hours or in the time when traffic is minimal on the affected application.
?
5.?Change Management and IT Owners Sign off – It is must that before conducting DR testing, change management and application owners informed and necessary approvals are taken. Batch jobs should be stopped during this time to avoid any false results and failures.
DR testing is one of the most important KPI of business’s IT landscape’s readiness to face any external threats and mitigate its impact. It is also one of the leading indicators for top management to measure the robustness of their IT infrastructure. DR test plan is as important as the DR plan to ensure business continuity and high resiliency.
Final Take – BCDR is much more than a plan.
Business continuity and disaster recovery plan and strategy is not just limited to making the organization resilient against external threats. BCDR strategy is a reflection of organization’s overall company’s responsiveness towards its customers, vendors and investors. Organization’s mindset in the time of crisis establishes the lasting impact on its customers and stakeholders. Hence, organization should always act wise to be ready with right BCDR strategy and overcome crisis than to regret and plan for it later.
??Gold Medalist, R&D Manager @ SIEMENS, AVP (PMI Pune), [IoT/IIOT] [Cloud] [PLM] [AI/ML] [Bigdata]
1 年Can't agree enough on the importance of a well designed and controlled Disaster Recovery Plan. In addition, businesses should also have a detailed Business Continuity Plan. I had a first hand experience in a crisis where these plans were made effective and implemented to the 'T'. There goes a lot of planning, coordination, integration and process control to effectively implement the process when need arises. One more important factor is to classify the systems on their criticality basis which the priority and SLAs are set for issue resolutions. Not just from an IT standpoint, but also from other areas like Manufacturing, Operations, Supply Chain, Design, Training etc. It was a godo Read Aakash Kshatriya