8 Disaster Recovery Scenarios to Test
This post was originally published at https://invenioit.com/continuity/disaster-recovery-scenarios-test/
Disaster recovery testing helps to ensure that businesses can effectively recover from an operational disruption. But knowing which disaster recovery scenarios to test can be tricky, especially when some threats seem to be constantly evolving.
Should you only test for scenarios that affect your IT systems? Only your data backup systems?
What about recovery plans for a pandemic? For example, what if you face staffing shortages, supply-chain disruptions or shelter-in-place orders that require your workers to work remotely?
In truth, there are endless disaster recovery scenarios to test if you want to be 100% prepared for every imaginable situation. But not all businesses have the resources or time for such robust testing. So let's look at some of the most crucial scenarios to test for.
Which disaster recovery scenarios to test for?
1) Data loss & backup recovery
This is one of the most important disaster recovery scenarios to test for. When data loss occurs, it's vital that your business is able to quickly restore it from a backup. That's true whether a single file has been deleted or an entire server has failed. If data can't be restored, then the situation could become a costly nightmare.
So, what exactly do you test?
First and foremost, you want to test that your backups are viable and can be restored. Run tests on both file-level restores and full machine recoveries to ensure that both can be completed in a real-world event.
Some things to consider after this testing:
· How long did the recovery take?
· Were RTO and RPO objectives met?
· What unexpected issues hindered the recovery process, if any?
· What improvements could be made to speed up the recovery?
All tests should be well documented. If issues arise that call for changes to the recovery process (including technology deployments, protocols or even the testing scenarios themselves), then the disaster recovery plan should be updated accordingly.
2) Failed backups
What happens when a backup can't be restored? This is a common situation for businesses that rely on traditional incremental backups, because of the data corruption that can occur in the backup chain. So, it's another important scenario that businesses should test for.
Testing for a failed backup typically involves two types of response:
· Troubleshooting the problem to see if the failed backup can be restored (time permitting)
· Restoring from another backup
If a secondary backup is available and can be quickly restored, that is usually preferable over spending time trying to "fix" or reconstruct the failed backup.
Restoring from another backup will require its own set of additional testing scenarios.
Example tests:
· Recovery from a cloud backup
· Bare metal restore
· Backup virtualization
· Hypervisor restore
· Export of backup image
· iSCSI Restore
Some data backup systems will of course have additional restore options, such as the Rapid Rollback option on the Datto SIRIS (a feature that lets you undo widespread file changes, such as those caused by ransomware). Since each BC/DR solution is unique, you'll want to periodically test every possible recovery method to ensure those options are actually usable in a real disaster.
3) Backup verification testing
Manually testing your backups is always a good idea, but it also can be time-consuming. Many backup systems now feature automated backup verification / validation checks that make this process more efficient.
The purpose of backup verification is to verify that a backup can actually be restored. It automates the testing process, checking each new backup for signs of data corruption or any other issues that could impede the recovery process.
While verification testing is designed to be automatic, it still requires oversight. Some things to consider:
· How often does backup verification occur?
· Is it configured properly?
· How is a successful verification (or failure) communicated? Is somebody actively reviewing the test results?
· What types of issues is the verification looking for? Do you have control over these scans?
4) Network interruptions & outages
A prolonged network outage can be just as disruptive as a data-loss event. When the network goes down—or even if a single workstation suddenly can't connect—IT managers must react quickly.
Testing your preparedness for network interruptions is the best way to ensure that you'll be able to rapidly resolve issues when they actually occur. There are a variety of network testing tools that can help to simulate common disaster scenarios.
Example tests include:
· Testing for unexpected surges in network traffic
· Mock tests that replicate the effects of a crippling network attack
· Network health testing that identifies potential problems in specific parts of the network
· Readiness tests that ensure that IT teams are able to rapidly respond
Remember, these tests should never be limited to just software-based testing. It's critical that network administrators routinely test these disaster recovery scenarios and actually go through the recovery protocols to ensure that they know exactly what to do during a real disruption.
5) Hardware failure
Hardware failure is one of the most common causes of data loss and operational disruptions, but how do you test for it?
Above, we touched on the importance of backup and recovery testing. But that's specific to the data. How quickly will you be able to repair or replace the bad hardware? The answer largely depends on how well your recovery teams have prepared for this scenario.
· What is the process for determining whether hardware can be salvaged or should be replaced?
· If replacement is needed, how fast can the new hardware be deployed?
· How can disaster recovery planning help to speed up the process? For example, are there vendor relationships that can ensure same-day replacement?
All of these questions relate to processes that should be routinely reviewed and tested. Restoring lost data is only the first part of this disaster scenario. A full recovery of the hardware and associated systems is critical for maintaining business continuity, which is why testing all recovery protocols is so essential.
6) Utility outages
Another important disaster recovery scenario to test is a sudden loss of electricity or other utilities. These scenarios are most common during severe weather and other natural disasters, but they can happen for a number of reasons.
Who can forget the NYC blackout in 2019 or the massive 2003 blackout that left large swaths of the Northeast without power?
When these and other everyday brownouts occur, businesses are usually at the mercy of the utility provider to restore power. But that doesn't mean they can't do anything. The costs of a power outage can quickly skyrocket, so every attempt should be made to restore operations through other means.
At the first signs of a utility disruption, recovery teams should be quick to work:
· Assessing whether the outage is localized to the building or widespread
· Communicating with the utility provider to report the outage and get ETAs for resolution
· Inspecting backup power sources, if deployed, to ensure they're working properly
· Prioritizing critical services and personnel as it relates to the power limitations of the backup power sources, and/or having teams work remotely if power is available elsewhere
Each one of these protocols should be routinely reviewed and tested to ensure that recovery teams are prepared to act swiftly and know exactly what to do when an outage occurs.
7) On-site threats & physical dangers
There are a number of disaster scenarios that can be extremely harmful to your employees and operations—and yet have little to do with your IT systems. This is why disaster recovery testing (and business continuity testing) should not be strictly limited to IT.
What if the business faces an active-shooter situation? How should employees protect themselves? Where do they go for safety?
Testing for different crisis scenarios can greatly reduce the risk of harm to your most valuable asset: your people. And by protecting your employees, you also protect your operations.
Tests to consider:
· Evacuation drills for fires, active-shooters and other on-site dangers
· Emergency procedures for tornados, earthquakes and other sudden natural disasters
· Testing the communications systems that will be used to keep employees updated during a prolonged disaster
8) Workforce interruptions
What happens when employees can't make it to work? This could be a situation like COVID-19, in which a viral outbreak forces workers to stay home. Or, it could be a number of other disaster scenarios:
· Terrorist activity
· Transportation stoppages
· Worker strikes
· Building damage or structural deficiencies
· Prolonged inaccessibility to building due to natural disaster or mandatory evacuations
Whatever the scenario, businesses can face a severe operational disruption if workers aren't able to do their jobs. So, having a Plan B is essential.
In response to the coronavirus pandemic, businesses rapidly shifted to remote work, but many were unprepared to do so in an effective way. Stressed IT systems caused additional roadblocks and increased cybersecurity risks. Many companies also lacked the tools to streamline their remote workers, which hurt productivity even further.
This is where testing can help deliver far better outcomes. Businesses need to routinely evaluate their preparedness for a sudden workforce interruption and put those protocols to the test. That could involve:
· Testing IT systems & platforms that facilitate remote work
· Testing the procedures that will help to maintain critical operations
· Testing the business's ability to relocate operations
Essentially, any process or system that will be used in response to a workforce interruption should be tested.