登录查看更多内容

Control an outage by localizing the failures

Arpit Bhayani

发布日期: 2022年8月11日

Outages are inevitable; but we should design our architecture and ensure that if a component is down, it should not lead to a complete outage.

What happened with GitHub?

GitHub saw a lot of failures with their Actions service and this led to delays in queued jobs from being processed. The root cause was some infrastructure error in the SQL layer.

Insights about their architecture

A couple of insights about their architecture

Synchronous dependency

Although GitHub Actions look like a single feature, internally it consists of multiple microservices. Some of these, have a synchronous dependency on the database. Because of this, when the DB had a hiccup, the entire Actions feature was hindered.

Zero trust communication

The service that was most affected in this outage handled communication; why would services need authentication? After all, they are all internal to the infrastructure?

Microservices talk. The communication needs to be protected with auth so that any engineer/service gone rogue cannot abuse the system in any capacity. Only authenticated and authorized services are allowed to take action.

What about automatic failover?

Given that the outage happened on the database layer, why did the database do not auto-recover? It is a standard procedure and configuration that would have just promoted a replica to be the new master.

Although it is a common config, during this outage the metrics did not show any issue with the database, and hence the auto-failover was never triggered. It took a long time to even understand the root cause and then start mitigation.

Long-Term Fixes

领英推荐

So, the outage is mitigated, now what?

Arpit Bhayani 2 年前

Mastering Kafka Resilience: The Art of Balancing High…

John Murillo-Giraldo 4 个月前

Strategies for Load Balancing in Software Architecture

Fernando Pereira da Silva 8 个月前

Update the automation scripts

The automation that reads the telemetry and decides to do a failover needs to be updated so that such failures are detected and action is taken.

Localizing failures

An important long-term change that needs to be driven is to localize the failure. In this outage, we learned how a hiccup in one database/service causes downtime of all dependent Microservices. This shouldn't have happened, as the Microservices are supposed to solve this very problem.

A good way to ensure that the blast radius of the outage is minima; is by ensuring the failures are localized, implying, that when a service is down, only the service is affected while everything else is functioning perfectly fine.

A common approach to getting this loose coupling is by powering inter-service communication through the asynchronous medium instead of synchronous API calls. Thus, if something breaks, we could fix it and continue to process the messages.

Here's the video of my explaining this in-depth ?? do check it out

Thank you so much for reading ?? If you found this helpful, do spread the word about it on social media; it would mean the world to me.

If you liked this short essay, you might also like my courses on

I teach an interactive course on System Design where you'll learn how to intuitively design scalable systems. The course will help you

become a better engineer
ace your technical discussions
get you acquainted with a spectrum of topics ranging from Storage Engines, High-throughput systems, to super-clever algorithms behind them.

I have compressed my ~10 years of work experience into this course, and aim to accelerate your engineering growth 100x. To date, the course is trusted by 800+ engineers from 11 different countries and here you can find what they say about the course.

Together, we will dissect and build some amazing systems and understand the intricate details. You can find the week-by-week curriculum and topics, testimonials, and other information at https://arpitbhayani.me/masterclass.

Arpit's Newsletter

108,527 位关注者

Arpit Bhayani

2 年

More about me: arpitbhayani.me Newsletter: arpitbhayani.me/newsletter Subscribe #AsliEngineering for such in-depth engineering concepts: https://www.youtube.com/c/ArpitBhayani Intermediate-Level System Design course: arpitbhayani.me/masterclass Beginner-friendly System Design course: https://www.school-of-programming.com Free course on microservices: https://courses.arpitbhayani.me/designing-microservices All GitHub Outages: https://courses.arpitbhayani.me/github-outage-dissections/

要查看或添加评论，请登录

查看全部

Control an outage by localizing the failures

Arpit Bhayani

What happened with GitHub?

Insights about their architecture

Synchronous dependency

Zero trust communication

What about automatic failover?

Long-Term Fixes

领英推荐

Update the automation scripts

Localizing failures

Arpit's Newsletter

108,527 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Understanding Kubernetes Ingress Controllers: Working, Benefits and Configuration.

The Great API Highway to Mainframe Modernization

Systemd Journal Logs with journalctl on Red Hat 9

Oracle High Availability Architecture with Load Balancing on Azure

Kubernetes Custom Resource and Custom Resource Definition (CRD)

Understanding Load Balancing in Software Architecture: A Comprehensive Guide

Independent Auxiliary Storage pool(IASP) on IBM i

n-Tier Architecture for Multimedia Data Handling in AWS

Your Monthly Dose of Mainframe Vol. 3

In the news for Mainframerz

What happened with GitHub?

Insights about their architecture

Synchronous dependency

Zero trust communication

What about automatic failover?

Long-Term Fixes

领英推荐

Update the automation scripts

Localizing failures

Arpit's Newsletter

108,527 位关注者

The best resource does not exist.

2024年9月22日

It's not about what you know, but about how you think

2024年9月8日

Roadmaps are just satisfying your urge to follow a syllabus

2024年8月18日

Always negotiate the offer you get

2024年8月11日

Proving your Culture Fit

2024年8月4日

Premature Abstractions

2024年7月28日

Tip the scale in your favor in interviews

2024年7月21日

7 questions that you should ask your interviewer

2024年7月14日

Traits of a 10x engineer

2024年7月7日

How PostgreSQL stores data in files, called forks

2024年6月30日

社区洞察

其他会员也浏览了

Understanding Kubernetes Ingress Controllers: Working, Benefits and Configuration.

The Great API Highway to Mainframe Modernization

Systemd Journal Logs with journalctl on Red Hat 9

Oracle High Availability Architecture with Load Balancing on Azure

Kubernetes Custom Resource and Custom Resource Definition (CRD)

Understanding Load Balancing in Software Architecture: A Comprehensive Guide

Independent Auxiliary Storage pool(IASP) on IBM i

n-Tier Architecture for Multimedia Data Handling in AWS

Your Monthly Dose of Mainframe Vol. 3

In the news for Mainframerz