OUTAGES!!
Navin Malhotra
Program Manager - Sustainability CoE @ Capgemini | Sustainability Solutions for Supply Chain Management
Were you one amongst the user of whatsapp/insta/facebook last week, who tried to check network and even restarted the mobile to see if there was an issue in mobile rather believing that world's largest social media platform was having an outage? This is what we have been made to believe that these media platform are not immune to outages and will always remain available to us because of state of the art technology and fail over methods implemented at their end, but then all of these software which in times behave erratically either because of lines of codes which runs the logic or the hardware which run those software are all immune to fail.
Coming to the issue, let's deep dive on whatever information is available related to cause of the outage which lasted for 6-8 hours. Below is what was shared by Facebook:
configuration changes on the backbone routers that co-ordinate network traffic between our data centres?caused issues that interrupted this communication
Well can this be understood by all users using social media platform and we are talking about billions of users (which also includes businesses as well) - for me it's a big "NO", many of the users will not be able to digest what is given to them and that is one of the key aspect of services, to keep the users aware about the issue in simplest way possible.
Now coming to the issue as highlighted by facebook in the memo (and as we are speaking there was another outage happened within a week, so something is really not cooking well), lets break that sentence into simpler terms. It uses, configuration changes, backbone router, network traffic, data centres. Let's see what does this all means:
领英推荐
Now let's go back to the memo shared by facebook and lets decrypt in simple terms- well facebook engineers were working on to make some changes typically in routers which were responsible to transfer data (text, image, videos etc) through the network connecting the data centers, which eventually will help us to access our profiles, which is nothing but the data stored in those data centers. What could have happened is one of those changes in routers could have behaved in a different way than what they were expected to, resulting in data not been transferred as it should - finally making millions of us not able to access our profiles as data was not available (data was still there, only challenge was the movement of data on the network traffic was not happening - say some accident happened on highway and we were not able to move our vehicles, so this is what had happened in simple terms)
Hope I have not complicated the issue more from understanding perspective. Now the next question comes why it happened - well you may be aware that for a company like facebook, they would have made thousands of changes in between the time you had started reading this article and have reached right here and offcourse they would have made lot of automations to detect failures in changes before they go and hit the systems but still in all the complexity, sometimes those are not detected on time and events like what many of us faced, happened. Though actual root cause of the issue is still not out and not sure if we will be able to see one or not but it would really be a good learning as to what was the actual cause of such a massive outage. Was it an incorrect configuration (change), which has missed their internal reviews (typically done over automated tools), or something which the automated tools themselves were not able to pick or something which is more resulting due to human error, ineffective testing, which was not able to pick the error - it could be anything, but for sure results of this failure should be made available to public as we all are stakeholders in the social media platform as it is out data which they are cashing on.
I rest my thoughts here and would requests my readers to share their comments on what they feel could be the actual cause of the outage!!
Heading ServiceNow Program & Global Delivery / Senior Client Partner / Customer Success / ServiceNow as a Service / Managed Services / Resource Augmentation
3 年I’m not too sure about the cause since I have not seen a proper rca on this. Moreover how do we know whether it’s correct or not ! However, good to read Navin
Excellence Evangelist| Delivery & Process Assurance | Senior Director Quality, Capgemini
3 年From the days' of NASA spaceship crash examples, we are moving into digital world crash examples for RCA