OUTAGES!!
Image source: https://chrissniderdesign.com/blog/resources/social-media-statistics/

OUTAGES!!

Were you one amongst the user of whatsapp/insta/facebook last week, who tried to check network and even restarted the mobile to see if there was an issue in mobile rather believing that world's largest social media platform was having an outage? This is what we have been made to believe that these media platform are not immune to outages and will always remain available to us because of state of the art technology and fail over methods implemented at their end, but then all of these software which in times behave erratically either because of lines of codes which runs the logic or the hardware which run those software are all immune to fail.

Coming to the issue, let's deep dive on whatever information is available related to cause of the outage which lasted for 6-8 hours. Below is what was shared by Facebook:

configuration changes on the backbone routers that co-ordinate network traffic between our data centres?caused issues that interrupted this communication

Well can this be understood by all users using social media platform and we are talking about billions of users (which also includes businesses as well) - for me it's a big "NO", many of the users will not be able to digest what is given to them and that is one of the key aspect of services, to keep the users aware about the issue in simplest way possible.

Now coming to the issue as highlighted by facebook in the memo (and as we are speaking there was another outage happened within a week, so something is really not cooking well), lets break that sentence into simpler terms. It uses, configuration changes, backbone router, network traffic, data centres. Let's see what does this all means:

  • Configuration Changes: Changes / modifications made on Configurable items (software, hardware etc), which could be like changing the code, changing some hardware items like RAM, hard disks, servers to improve their performance or to fix vulnerability, defects or even proactive changes to avoid any future impacts / outages. These needs to be handled very cautiously and is the reason it goes through multiple test and approvals as even a small change may have a large impact on systems (which seems the case here which resulted in the outage)
  • Backbone Router: A router is a device which helps in transferring data (text, image, videos etc) between computer networks. A backbone router is a type of router that links separate systems (in our context systems could be different machines which makes us use social media platform) in different network with each other.
  • Network Traffic: Take is as a vehicular traffic, only difference is in place of vehicles you will have different forms of data travelling across systems, networks, bridges (yes in n/w world we do have those) etc. The hardware which we use sometimes if faulty can impact the network traffic, just like any accident, can impact the movement of vehicles, same way in an event of any failure of underlying hardware can impact data transfer between systems and network. BTW we do have network highways as well in the networking world.
  • Data Centres: Physical spaces (rooms / buildings) where multiple systems are stored where tons of data is kept to be used across the systems and networks. We can call that as heart which pumps data in and out between the systems stored in data centers (sometimes multiple data centers) and to our devices (laptops, smart tvs, mobiles, tabs, routers, etc). We all have heard about cloud, those are nothing but these physical data centers only but with high speed network to transfer data at speed of light.

Now let's go back to the memo shared by facebook and lets decrypt in simple terms- well facebook engineers were working on to make some changes typically in routers which were responsible to transfer data (text, image, videos etc) through the network connecting the data centers, which eventually will help us to access our profiles, which is nothing but the data stored in those data centers. What could have happened is one of those changes in routers could have behaved in a different way than what they were expected to, resulting in data not been transferred as it should - finally making millions of us not able to access our profiles as data was not available (data was still there, only challenge was the movement of data on the network traffic was not happening - say some accident happened on highway and we were not able to move our vehicles, so this is what had happened in simple terms)

Hope I have not complicated the issue more from understanding perspective. Now the next question comes why it happened - well you may be aware that for a company like facebook, they would have made thousands of changes in between the time you had started reading this article and have reached right here and offcourse they would have made lot of automations to detect failures in changes before they go and hit the systems but still in all the complexity, sometimes those are not detected on time and events like what many of us faced, happened. Though actual root cause of the issue is still not out and not sure if we will be able to see one or not but it would really be a good learning as to what was the actual cause of such a massive outage. Was it an incorrect configuration (change), which has missed their internal reviews (typically done over automated tools), or something which the automated tools themselves were not able to pick or something which is more resulting due to human error, ineffective testing, which was not able to pick the error - it could be anything, but for sure results of this failure should be made available to public as we all are stakeholders in the social media platform as it is out data which they are cashing on.

I rest my thoughts here and would requests my readers to share their comments on what they feel could be the actual cause of the outage!!

Partho Biswas

Heading ServiceNow Program & Global Delivery / Senior Client Partner / Customer Success / ServiceNow as a Service / Managed Services / Resource Augmentation

3 年

I’m not too sure about the cause since I have not seen a proper rca on this. Moreover how do we know whether it’s correct or not ! However, good to read Navin

Vishnu Varthanan Moorthy

Excellence Evangelist| Delivery & Process Assurance | Senior Director Quality, Capgemini

3 年

From the days' of NASA spaceship crash examples, we are moving into digital world crash examples for RCA

要查看或添加评论,请登录

Navin Malhotra的更多文章

社区洞察

其他会员也浏览了