When GitHub repository creation failed

When GitHub repository creation failed

Just imagine you trying to create a repository on GitHub and it is not working, and this happened to GitHub in April 2021 when their users were not able to create a new repository.

The root cause for this outage was something that seems unrelated - Scanning Secrets. The root cause makes this outage super interesting to dissect.

What is Secret Scanning?

Our API servers need to talk to peripheral components like Databases, Cache, SaaS services, etc. This communication involves some sort of authentication and authorization through auth tokens, passwords, or secret keys.

Developers tend to commit the secrets in the settings/constant files and push them to GitHub. What if the repository content gets leaked? What if GitHub itself has a data breach and the attacker gets access to the private repositories?

If the secrets like AWS access keys, auth tokens, and DB passwords are leaked and the attacker can then get the dump of the data and ask for a ransom. Or they may even abuse the infrastructure to perform some illegal activities or mine cryptocurrencies.

Hence, GitHub periodically runs a job that checks all the repositories for any secrets that are committed and warns the user about it.

Repository Creation Flow

When a repository is created an entry is made into the Secret Scanning table which is then used by a job that scans for potential secrets and notifies the owner.

What led to the outage?

The GitHub team ran a data migration in which they moved the Secret Scanning table from a common database to its own cluster allowing it to scale independently.

The GitHub team was unaware of this dependency! and hence after the migration of the table happened to a different database the creation of a new repository started failing to lead to this outage. It is interesting to see such mature products having blindspots.

How did GitHub mitigate it?

The mitigation strategy of GitHub was to roll back the migration. Although it is unclear from the incident report on what exactly they did but there are a few speculations

  1. they could have recopied the table quickly to the old database
  2. whitelisted the database so that applications could connect
  3. the old table would have been intact and hence they would have just renamed and made it active again.

Again, it is pure speculation given we do not have any insider information nor they specified in the report. It would have been fun to have gone through their actual mitigation steps. We could have learned so much, but nonetheless, we did learn a few interesting insights from this outage.

Here's the video of my explaining this in-depth ?? do check it out

Imagine you trying to create a new GitHub repository and its call is failing, failing for 53 minutes. This happened with GitHub in April 2021 when for 53 minutes people were unable to create any new repositories. Upon investigation, they found out that the root was scanning secrets. Two seemingly different usecases took down one of the most important APIs.

This has to be one of the most amusing outages that I have seen in recent times. In this video, we dissect this outage, understand the root cause of it, look at the importance of secret scanning, and conclude with an understanding of their mitigation process.

Outline:

  • 00:00 Agenda
  • 02:49 What happened?
  • 03:20 Secret scanning and its importance
  • 08:28 Repository Creation Flow
  • 09:15 What lead to an outage?
  • 17:26 Outage mitigation

You can also

Thank you so much for reading ?? If you found this helpful, do spread the word about it on social media; it would mean the world to me.

You can also follow me on your favourite social media LinkedIn, and Twitter.

Yours truly,

Arpit

arpitbhayani.me

Until next time, stay awesome :)

No alt text provided for this image

I teach a course on System Design where you'll learn how to intuitively design scalable systems. The course will help you

  • become a better engineer
  • ace your technical discussions
  • get you acquainted with a massive spectrum of topics ranging from Storage Engines, High-throughput systems, to super-clever algorithms behind them.

I have compressed my ~10 years of work experience into this course, and aim to accelerate your engineering growth 100x. To date, the course is trusted by 600+ engineers from 10 different countries and here you can find what they say about the course.

Together, we will build some of the most amazing systems and dissect them to understand the intricate details. You can find the week-by-week curriculum and topics, benefits, testimonials, and other information here https://arpitbhayani.me/masterclass.

More about me: arpitbhayani.me Newsletter: arpitbhayani.me/newsletter Subscribe #AsliEngineering for such in-depth engineering concepts: https://www.youtube.com/c/ArpitBhayani Check out my System Design course: arpitbhayani.me/masterclass

要查看或添加评论,请登录

社区洞察

其他会员也浏览了