What should we do after Facebook outage?
Inna Kuznetsova
CEO | PE-Backed B2B SaaS Leader | Board Director (Freightos, NASDAQ: CRGO; SeaCube) | Supply Chain | Artificial Intelligence, Machine Learning
Deviating from the recent fashion of talking about the evils of Facebook, I actually want to thank the Facebook team for restoring the systems. Thank you for your stressful night and hard work and bringing the much needed service in social networks, including Whatsup, back to life for us, the users.
As someone who ran a network business that experienced an outage and worked shoulder to shoulder with our team to get it back to life, I learned a few things.
First, you are never 100% safe, because new unpredictable events happen. New viruses get created and new malicious attacks happen. New software from reputable vendors may have hidden bugs. In our case it was exactly the case - and the bugs only manifested when two particular versions of two products of the same vendor, an OS and a rare file system, were used together. You can constantly improve monitoring and reaction protocols, but you are never fully safe. Anyone who ever drove a car knows it: the best ones break or get stuck in traffic behind an accident on the way to important meetings.
Second, it takes time to locate the issue and recreate it in a lab. Ever responded to baby's cry at night? Let me tell you, the feeling is very much the same. You get a call - sometimes, in the middle of the night - and the rest of your day is disrupted by tests, calling vendors and most of all, uncertainty. There is a lot of diagnostics and logs and experiments to conduct before you can locate an issue and confirm it and address it - which also takes time. In our case we had to recreate the whole clustering for a mission-critical system in the industry. We had office heroes who stayed awake for over 24 hours or slept in the office, had to call a drive next day because getting behind a wheel was dangerous. The vendor had to dispatch a team and then another team and then a team of people flying from a lab in a different city. It takes time and time goes slow when the customers call every minute, trying to explain the impact you have on their business. Some of them feel they need to get rude to get persuasive - emotions run high.
And that brings us to the next point. Third, the customers can forgive the technical issues but not the lack of transparency. The first rule of a SaaS business handling outages is to have a good communications system to keep those dependent on it up to speed. Facebook published the status - thank you for that. In our case, been in a B2B environment, we also organized a system of direct calls by the senior executive team to the top customers every few hours to update them on our findings to date, actions taken and projections. We also equipped the client-facing specialists with information that they could use for their contacts, kept the senior executive team and the board up to speed. Explaining the deep technical issues in a way that could be understood at all levels of familiarity with IT is hard even without the fear of losing business, jobs, credibility and trust.
I could never thank our team enough when we emerged from the outage - everyone, from IT to customer service, account managers, executive team pitched in. And I hope that the Facebook engineering team gets a good sleep and a lot of praise today. So, let me be a grateful customer and do what our best customers did: say 'thank you' for transparency and restoring the service!
*
I write about bringing IT innovation into business services, especially logistics, shipping and SCM, as well as about building a successful career and women empowerment. To read my future posts please click Follow from here and feel free to join me on Twitter.
VP, Product Management | Scaling Digital & Data-Driven Solutions | Strategic Growth Leader | Columbia Business School MBA
5 年Well said Inna ! These unsung heroes never get mentioned once the service is restored - always important to recognize them and say ‘Thank you’ for their efforts...
Strategy, Commercial Build, Operational Excellence, Executive Management
5 年As you look back now , there was so much to learn from the one incidence !