Gitlab  down? What caused it and how  to prepare for it

Gitlab down? What caused it and how to prepare for it

GitLab is an open-source platform based on Git. It includes a full suite of tools and functionality for managing Git repositories, project planning, continuous integration/continuous deployment (CI/CD), code review, issue tracking, and more. GitLab effectively combines the entire software development lifecycle into a single smooth interface, simplifying and streamlining the process.

On July 11, 2024, the git status website https://status.gitlab.com/ stated that GitLab.com was unavailable. Fortunately, the issue has been addressed, and Gitlab.com is back up and running.

GitLab uses a primary database to store its data. This primary database has copies of itself on multiple servers. These replicas are used to distribute the load and improve performance. The incident was caused by an? input/output (I/O) stall that happened with one of the replica databases.?

A disk I/O is an operation that happens when your computer reads from or writes to a disk (like a hard drive). A stall happens when there is a delay in these operations. As such, a disk I/O stall can slow down system performance, because it’s waiting for the data it needs from the disk. In the case of 11 July, one of the replicas was unable to receive data from the primary database. This means it couldn’t update its data to match the primary database. As a result, any requests that were directed to this replica by the load balancers (which distribute network traffic) failed after a certain period of time (a timeout).

To handle the increasing number of queries, the pod autoscaler (which manages the load) created more worker processes. However, each new worker process also had to discover the failing replica. Once they did, they too got stuck in the queue waiting for responses to their queries. This caused the problem to escalate, creating a snowball effect.

This incident demonstrates how a minor issue can disrupt infrastructure and load-balancing. For more details, refer to the note on the issue here.?

To mitigate the impact of GitLab downtimes, consider the following precautions:

Schedule regular backups of your GitLab repositories, including the database and configuration files.

  • Set up monitoring tools to track GitLab performance and receive alerts for any downtime or unusual activity.

Have a clear recovery plan in place, detailing steps to restore service and access backups quickly.

  • Implement fallback mechanisms in your CI/CD pipelines to handle interruptions gracefully.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了