登录查看更多内容

OUTAGES!!

Navin Malhotra

Program Manager - Sustainability CoE @ Capgemini | Sustainability Solutions for Supply Chain Management

发布日期: 2021年10月11日

Were you one amongst the user of whatsapp/insta/facebook last week, who tried to check network and even restarted the mobile to see if there was an issue in mobile rather believing that world's largest social media platform was having an outage? This is what we have been made to believe that these media platform are not immune to outages and will always remain available to us because of state of the art technology and fail over methods implemented at their end, but then all of these software which in times behave erratically either because of lines of codes which runs the logic or the hardware which run those software are all immune to fail.

Coming to the issue, let's deep dive on whatever information is available related to cause of the outage which lasted for 6-8 hours. Below is what was shared by Facebook:

configuration changes on the backbone routers that co-ordinate network traffic between our data centres?caused issues that interrupted this communication

Well can this be understood by all users using social media platform and we are talking about billions of users (which also includes businesses as well) - for me it's a big "NO", many of the users will not be able to digest what is given to them and that is one of the key aspect of services, to keep the users aware about the issue in simplest way possible.

Now coming to the issue as highlighted by facebook in the memo (and as we are speaking there was another outage happened within a week, so something is really not cooking well), lets break that sentence into simpler terms. It uses, configuration changes, backbone router, network traffic, data centres. Let's see what does this all means:

领英推荐

That Fibre Friday Feeling?

RecruitME 7 个月前

Would You Rather Fix a Network Outage, or Prevent it…

NetBrain Technologies Inc. 2 年前

In OSPF, Why BDR is Elected First ? with a Video…

Atul Sharma ???? 1 个月前

Configuration Changes: Changes / modifications made on Configurable items (software, hardware etc), which could be like changing the code, changing some hardware items like RAM, hard disks, servers to improve their performance or to fix vulnerability, defects or even proactive changes to avoid any future impacts / outages. These needs to be handled very cautiously and is the reason it goes through multiple test and approvals as even a small change may have a large impact on systems (which seems the case here which resulted in the outage)
Backbone Router: A router is a device which helps in transferring data (text, image, videos etc) between computer networks. A backbone router is a type of router that links separate systems (in our context systems could be different machines which makes us use social media platform) in different network with each other.
Network Traffic: Take is as a vehicular traffic, only difference is in place of vehicles you will have different forms of data travelling across systems, networks, bridges (yes in n/w world we do have those) etc. The hardware which we use sometimes if faulty can impact the network traffic, just like any accident, can impact the movement of vehicles, same way in an event of any failure of underlying hardware can impact data transfer between systems and network. BTW we do have network highways as well in the networking world.
Data Centres: Physical spaces (rooms / buildings) where multiple systems are stored where tons of data is kept to be used across the systems and networks. We can call that as heart which pumps data in and out between the systems stored in data centers (sometimes multiple data centers) and to our devices (laptops, smart tvs, mobiles, tabs, routers, etc). We all have heard about cloud, those are nothing but these physical data centers only but with high speed network to transfer data at speed of light.

Now let's go back to the memo shared by facebook and lets decrypt in simple terms- well facebook engineers were working on to make some changes typically in routers which were responsible to transfer data (text, image, videos etc) through the network connecting the data centers, which eventually will help us to access our profiles, which is nothing but the data stored in those data centers. What could have happened is one of those changes in routers could have behaved in a different way than what they were expected to, resulting in data not been transferred as it should - finally making millions of us not able to access our profiles as data was not available (data was still there, only challenge was the movement of data on the network traffic was not happening - say some accident happened on highway and we were not able to move our vehicles, so this is what had happened in simple terms)

Hope I have not complicated the issue more from understanding perspective. Now the next question comes why it happened - well you may be aware that for a company like facebook, they would have made thousands of changes in between the time you had started reading this article and have reached right here and offcourse they would have made lot of automations to detect failures in changes before they go and hit the systems but still in all the complexity, sometimes those are not detected on time and events like what many of us faced, happened. Though actual root cause of the issue is still not out and not sure if we will be able to see one or not but it would really be a good learning as to what was the actual cause of such a massive outage. Was it an incorrect configuration (change), which has missed their internal reviews (typically done over automated tools), or something which the automated tools themselves were not able to pick or something which is more resulting due to human error, ineffective testing, which was not able to pick the error - it could be anything, but for sure results of this failure should be made available to public as we all are stakeholders in the social media platform as it is out data which they are cashing on.

I rest my thoughts here and would requests my readers to share their comments on what they feel could be the actual cause of the outage!!

Partho Biswas

Heading ServiceNow Program & Global Delivery / Senior Client Partner / Customer Success / ServiceNow as a Service / Managed Services / Resource Augmentation

3 年

I’m not too sure about the cause since I have not seen a proper rca on this. Moreover how do we know whether it’s correct or not ! However, good to read Navin

1 次回应

Vishnu Varthanan Moorthy

Excellence Evangelist| Delivery & Process Assurance | Senior Director Quality, Capgemini

3 年

From the days' of NASA spaceship crash examples, we are moving into digital world crash examples for RCA

2 次回应

查看更多评论

要查看或添加评论，请登录

Navin Malhotra的更多文章

Navigating the Complexities of Scope 3.1 Carbon Reporting

2024年7月26日

Navigating the Complexities of Scope 3.1 Carbon Reporting

As outlined in my previous article on the challenges of carbon reporting, I'd like to delve deeper into the…
The AI Advantage: Empowering Project Managers to Make Smarter, Faster Decisions

2024年6月11日

The AI Advantage: Empowering Project Managers to Make Smarter, Faster Decisions

The project management landscape is undergoing a significant transformation. Artificial intelligence (AI) is rapidly…
Carbon Management: Building an Accurate and Sustainable Footprint

2024年2月27日

Carbon Management: Building an Accurate and Sustainable Footprint

Introduction: Carbon reporting is evolving from a voluntary exercise to a critical business function with significant…
GenAI and Sustainability: A Powerful Alliance for a Greener Future

2024年2月22日

GenAI and Sustainability: A Powerful Alliance for a Greener Future

Boardrooms are buzzing with two words: GenAI and sustainability. They're not rivals, but teammates transforming…

2 条评论
Unlocking the Symphony of Sustainability: A Call to Action in the Age of Climate Crisis

2024年2月12日

Unlocking the Symphony of Sustainability: A Call to Action in the Age of Climate Crisis

Climate change isn't just a distant threat; it's the looming specter casting its shadow over businesses and humanity…
The True Guru: Guiding Light in Our Journey of Transformation

2023年7月3日

The True Guru: Guiding Light in Our Journey of Transformation

Today, as India celebrates Guru Purnima, we gather to pay homage to the gurus, teachers, mentors, and guiding forces…
Marathons: Beyond Fitness, Embracing Equity and Inclusion

2023年7月2日

Marathons: Beyond Fitness, Embracing Equity and Inclusion

Marathons have become much more than just long-distance foot races. While the traditional 42.
Climate Action & Humans

2023年6月26日

Climate Action & Humans

We are already seeing an increase in nature-related accidents like wildfires, earthquakes, tsunamis, excessive rains…

1 条评论
AI a friend or enemy

2023年4月26日

AI a friend or enemy

My readers would have seen one recent change which I have made in my LinkedIn profile heading - "Program Manager trying…

1 条评论
Human Leadership

2022年10月21日

Human Leadership

In today's fast-changing world one thing which is getting into center stage is how we define and deliver "value", which…

3 条评论

See all articles

OUTAGES!!

Navin Malhotra

Program Manager - Sustainability CoE @ Capgemini | Sustainability Solutions for Supply Chain Management

领英推荐

Navin Malhotra的更多文章

社区洞察

其他会员也浏览了

Why mobile networks should be decentralized

Beyond the Glitching Symphony: Owning Mistakes and Building Resilience in the Digital Orchestra

Troubleshooting a Network Outage: Jeremy's Inner Monolog

Global Technology Outage News, July 19, 2024

Leaders Need Peace of Mind

The Downfall of the Runet, possible scenarios.

Latency, jitter, packet loss, bandwidth restrictions, etc...Simulating Network Conditions with Traffic Control (TC)

Global Outage Causes Major Disruptions

DEFAULT GATEWAY - EXPLAINED

Global IT Outage: A Glitch Grounds Flights and Disrupts Businesses

领英推荐

Navin Malhotra的更多文章

Navigating the Complexities of Scope 3.1 Carbon Reporting

The AI Advantage: Empowering Project Managers to Make Smarter, Faster Decisions

Carbon Management: Building an Accurate and Sustainable Footprint

GenAI and Sustainability: A Powerful Alliance for a Greener Future

Unlocking the Symphony of Sustainability: A Call to Action in the Age of Climate Crisis

The True Guru: Guiding Light in Our Journey of Transformation

Marathons: Beyond Fitness, Embracing Equity and Inclusion

Climate Action & Humans

AI a friend or enemy

Human Leadership

社区洞察

其他会员也浏览了

Why mobile networks should be decentralized

Beyond the Glitching Symphony: Owning Mistakes and Building Resilience in the Digital Orchestra

Troubleshooting a Network Outage: Jeremy's Inner Monolog

Global Technology Outage News, July 19, 2024

Leaders Need Peace of Mind

The Downfall of the Runet, possible scenarios.

Latency, jitter, packet loss, bandwidth restrictions, etc...Simulating Network Conditions with Traffic Control (TC)

Global Outage Causes Major Disruptions

DEFAULT GATEWAY - EXPLAINED

Global IT Outage: A Glitch Grounds Flights and Disrupts Businesses