"No probs, we have the backups" - is it enough when <youknowwhat> happens?
This is an old story, but I used it many times when the question is asked: ?Do you have some contingency plan? Do you have backups”
We ran a mission-critical, almost real-time mainframe software system. There was a big construction site next door, and the work included some digging work. Then a huge flash of light as they said afterwards (we have not seen anything in the computer room) and, oops, all power went off.
Then, as it was well-planned that something like that might happen, our large UPS system started with a big ?QUACK” sound. Ok then, what’s next?
On operational level we had a protocol – to ?freeze” the system and re-launch only when it is completely safe to do so. However, it also meant that all activities are stopped, and it takes some time to restart the software, and customers are not served during this break. The warm restart required an average 15-30 minutes.
So the Director who was responsible authorise system operators to perfom "system freeze" command was hesitating. Oh we have an UPS. For about 30 minutes (back in 1994...). Most blackouts last not more than 3-5 minutes! Let’s wait a little if it returns!
To cut the long story short, a decision to freeze the system was made only when it was too late. Power was not available, the batteries of the UPS got depleted (and it took several hours to charge them again), and the system went down, uncontrolled, without someone doing the right thing and saving all transactional data.
The machine though, a huge mainframe, and the software on it, was a brilliant thing. It still managed to save MOST of the data, (had an inbuilt mechanism for these cases, like someone in the developer team thought about undecisive Directors and unforeseen, sudden events…) – which meant that we lost only about say 40-80 transactions. The total number of transactions on that day was about 8000 - so it could have been worse. Lost data belonged to individual betting tickets, some of them being "winners". They had to be paid at face value, which meant manual administration.
It also took about 15 hours of recovery work. After initial restart, it took several hours and lots of effort to find the data saved by the machine in emergency, and then create a database with a macro enabling fast data search and payment. That was actually the time I learnt to write spreadsheet macros to speed up the process.
And – remember – following the proper protocol would have meant 30 minutes, instead of the 15+ hours and extra hassle it finally meant because we did not follow the proper protocol.
Then we wrote a very detailed contingency plan, not only with technical processes, but authorisations and ?automatic” responses to avoid these situations and risks.
So why I did tell you this old story? Because of the lessons learned from this.
1. About the importance of having a Plan B, a Contingency Plan (which goes beyond the ?oh we have the UPS” shortsightedness), a whole table of authorisations (who does what in a certain situation), also including Plan C, now knowing well that the UPS we had that time (early 90s) could perform for 30 minutes, and then it required 6 hours to be ready for its next job.
2. About making the right decision at the right time – there is usually a ?time window” to minimise risk, if you miss that, whatever decision you make, it will not be the best option. If you have any options at all.
3. About responsibility of management. If you need to make a decision, you have to take responsibility, especially in an emergency situation, even if you are called over-cautious or a risker. Making decisions is the primary reason WHY you sit in that fancy office, behind your fancy desk – not just to have a face to represent the company.
4. About having the right information and knowing what it means – if you don’t, you establish your decision on false facts (but beware: by the time you get 100% information, you might already miss the boat – another factor which is an important part of the art of management…)
5. Mistakes were made, they can be be survived – but as I read in the famous biography of Lee Iacocca ?please learn enough from this case and avoid, by all means, making the same mistake again”.
We are now experiencing a situation where you may have to cancel almost all elements of your Plan A. Plan B may not work longer than a week. A month later you might work according to Plan X - but still much better than having no plan at all in the beginning, at least a basic one for emergency. And yes, we need to test whatever options and solutions available in a very quick and very agile mode. Good luck (and good problem solving skills) to you all. Stay safe.