登录查看更多内容

What happens during planned maintenance?

Arpit Bhayani

发布日期: 2022年7月28日

Master failover failed for GitHub leading to a 5-hour long incident, let's see what happened.

Incident Summary

For five hours, GitHub users observed delays in data being visible on the interface and API after it was written on the database. This happened during the maintenance when they were switching the Master DB.

Planned Maintenance

Planned maintenance is a popular way for companies to take a small downtime and execute all the maintenance activities. Some activities for which we do plan database maintenance are

applying security patches
apply version upgrades
parameter tuning
hardware replacement
periodic reboots

A popular activity during database maintenance is to switch the Master node i.e. shift the traffic coming from the master node to a new node so that we could patch the old instance.

For a very short duration, when the config switch is happening, the database would be unavailable leading to a small outage; and this is expected behavior.

Database Crash

During the failover, when the traffic was moved to the new database, the mysqld process crashed. This led to incoming writes failing. To quickly mitigate the issue, the team moved the traffic to the old database. This solved the issue and the site was up and running.

Something interesting happened

The new database before crashing served the write traffic for 6 seconds. So, after the crash when the traffic was redirected to the old database, it did not have the data that was written in that 6 seconds window.

This is a huge concern, as it would lead to bad UX, and in the worst case consistency failures. So, how to remediate this issue?

Remediating master failovers

In order to remediate this, we take the help of the Write Ahead Log or Commit log of the database. Whenever we do a failover, we always keep track of the BINLOG coordinates.

Once we moved the traffic to the old database, all we have to do is iterate through the BINLOG and apply all the changes that happened on the new database post the noted coordinate on the old database.

This would re-create or modify the exact data that was written to the new database on the old database, leading to zero data loss or consistency breach.

Cleaning up the mess

Typically when we have such a failover, it is better that we restore the read replicas and hence GitHub team rotated all the replicas. Creating a read replica takes time, given the scale of GitHub.

It took them 4 hours to set up replicas and 1 hour to re-configure the cluster hence for over 5 hours the incident was affecting the users.

Here's the video of my explaining this in-depth ?? do check it out

领英推荐

7 best practices for a successful IDMS migration to…

IBM ModernSystems 1 年前

The Data Architects: Vane Technologies' DBA Solutions…

Vane Technologies 1 年前

Configuring Statement Timeout in PostgreSQL for…

MinervaDB 10 个月前

Companies announce their planned maintenance, what happens during that? Could something go wrong while running maintenance?

GitHub team was switching their Master databases from one node to another; while doing this something went wrong and the new database crashed. This led to data divergence and a production incident that lasted over 5 hours.

In this video, we dissect this incident and understand what happens during planned maintenance, what went wrong with GitHub, how GitHub mitigated it, and understand some really cool things about switching databases and solving data divergence.

Outline:

00:00 Agenda
02:42 What happened?
03:29 Scaling reads with Read Replicas
04:40 Planned Database Maintenance
10:08 Database crashed and quick mitigation
11:44 Data Divergence between two masters
13:54 Remediating Data Divergence
18:23 Read Replica taking time to spin up

Check out the free course covering all GitHub outages → https://courses.arpitbhayani.me/github-outage-dissections/

You can also

Subscribe to the YT Channel Asli Engineering
Download the notes
Listen to this on the go on Spotify

Thank you so much for reading ?? If you found this helpful, do spread the word about it on social media; it would mean the world to me.

You can also follow me on your favourite social media LinkedIn, and Twitter.

Yours truly,

Arpit

arpitbhayani.me

Until next time, stay awesome :)

I teach a course on System Design where you'll learn how to intuitively design scalable systems. The course will help you

become a better engineer
ace your technical discussions
get you acquainted with a massive spectrum of topics ranging from Storage Engines, High-throughput systems, to super-clever algorithms behind them.

I have compressed my ~10 years of work experience into this course, and aim to accelerate your engineering growth 100x. To date, the course is trusted by 600+ engineers from 10 different countries and here you can find what they say about the course.

Together, we will build some of the most amazing systems and dissect them to understand the intricate details. You can find the week-by-week curriculum and topics, benefits, testimonials, and other information here https://arpitbhayani.me/masterclass.

Arpit's Newsletter

117,879 位关注者

Arpit Bhayani

2 年

More about me: arpitbhayani.me Newsletter: arpitbhayani.me/newsletter Subscribe #AsliEngineering for such in-depth engineering concepts: https://www.youtube.com/c/ArpitBhayani Intermediate-Level System Design course: arpitbhayani.me/masterclass Beginner-friendly System Design course: https://www.school-of-programming.com Free course on microservices: https://courses.arpitbhayani.me/designing-microservices All GtiHub Outages: https://courses.arpitbhayani.me/github-outage-dissections/

要查看或添加评论，请登录

Arpit Bhayani的更多文章

Leave your job with grace and gratitude

2025年2月23日

Leave your job with grace and gratitude

This edition of the newsletter contains one quick write-up that will help you grow faster in your career a video I…

6 条评论
Turn Boring Projects into Opportunities

2025年2月16日

Turn Boring Projects into Opportunities

This edition of the newsletter contains one quick write-up that will help you grow faster in your career a video I…

1 条评论
When is the right time to switch?

2025年2月10日

When is the right time to switch?

This edition of the newsletter contains one quick write-up that will help you grow faster in your career a video I…

7 条评论
Ramping up faster in your new job

2025年2月2日

Ramping up faster in your new job

This edition of the newsletter contains one quick write-up that will help you grow faster in your career a video I…

4 条评论
Back Your Disagreement with Data

2025年1月26日

Back Your Disagreement with Data

This edition of the newsletter contains one quick write-up that will help you grow faster in your career a video I…

2 条评论
Doubt yourself every day

2025年1月19日

Doubt yourself every day

This edition of the newsletter contains one quick write-up that will help you grow faster in your career a video I…

9 条评论
Not everything needs to be dumbed down

2025年1月12日

Not everything needs to be dumbed down

This edition of the newsletter contains one quick write-up that will help you grow faster in your career a video I…

11 条评论
The best resource does not exist.

2024年9月22日

The best resource does not exist.

This edition of the newsletter contains two quick write-ups about The best resource is mythical Convergent Encryption I…

4 条评论
It's not about what you know, but about how you think

2024年9月8日

It's not about what you know, but about how you think

This edition of the newsletter contains two quick write-ups about It's not about what you know, but about how you think…

1 条评论
Roadmaps are just satisfying your urge to follow a syllabus

2024年8月18日

Roadmaps are just satisfying your urge to follow a syllabus

This edition of the newsletter contains one quick write-up about Roadmaps are just satisfying your urge to follow a…

3 条评论

See all articles

What happens during planned maintenance?

Arpit Bhayani

Incident Summary

Planned Maintenance

Database Crash

Something interesting happened

Remediating master failovers

Cleaning up the mess

领英推荐

Arpit's Newsletter

117,879 位关注者

Arpit Bhayani的更多文章

社区洞察

其他会员也浏览了

How databases are managed in production?

Turbo-Charging Replication Re-initializations

OCI On-Prem to Cloud Database Migration | DMS - Online Migration | Part 2

A Comprehensive Guide to Enhancing Server Performance (Part 1)

What is a database management system (DBMS)?

Database Recovery Strategies: A Fundamentals Instance Recovery Approach

A Comprehensive Guide to Migrating from Oracle to Packet

Oracle Standby NOLOGGING: Modes and Use Cases in Oracle 18c and Later

SQL Backup Master

Oracle Exadata Automated Patching Methodology at Large-Scale Workloads

Incident Summary

Planned Maintenance

Database Crash

Something interesting happened

Remediating master failovers

Cleaning up the mess

领英推荐

Arpit's Newsletter

117,879 位关注者

Arpit Bhayani的更多文章

Leave your job with grace and gratitude

Turn Boring Projects into Opportunities

When is the right time to switch?

Ramping up faster in your new job

Back Your Disagreement with Data

Doubt yourself every day

Not everything needs to be dumbed down

The best resource does not exist.

It's not about what you know, but about how you think

Roadmaps are just satisfying your urge to follow a syllabus

社区洞察

其他会员也浏览了

How databases are managed in production?

Turbo-Charging Replication Re-initializations

OCI On-Prem to Cloud Database Migration | DMS - Online Migration | Part 2

A Comprehensive Guide to Enhancing Server Performance (Part 1)

What is a database management system (DBMS)?

Database Recovery Strategies: A Fundamentals Instance Recovery Approach

A Comprehensive Guide to Migrating from Oracle to Packet

Oracle Standby NOLOGGING: Modes and Use Cases in Oracle 18c and Later

SQL Backup Master

Oracle Exadata Automated Patching Methodology at Large-Scale Workloads