Gitlab down? What caused it and how to prepare for it
GitLab is an open-source platform based on Git. It includes a full suite of tools and functionality for managing Git repositories, project planning, continuous integration/continuous deployment (CI/CD), code review, issue tracking, and more. GitLab effectively combines the entire software development lifecycle into a single smooth interface, simplifying and streamlining the process.
On July 11, 2024, the git status website https://status.gitlab.com/ stated that GitLab.com was unavailable. Fortunately, the issue has been addressed, and Gitlab.com is back up and running.
GitLab uses a primary database to store its data. This primary database has copies of itself on multiple servers. These replicas are used to distribute the load and improve performance. The incident was caused by an? input/output (I/O) stall that happened with one of the replica databases.?
A disk I/O is an operation that happens when your computer reads from or writes to a disk (like a hard drive). A stall happens when there is a delay in these operations. As such, a disk I/O stall can slow down system performance, because it’s waiting for the data it needs from the disk. In the case of 11 July, one of the replicas was unable to receive data from the primary database. This means it couldn’t update its data to match the primary database. As a result, any requests that were directed to this replica by the load balancers (which distribute network traffic) failed after a certain period of time (a timeout).
To handle the increasing number of queries, the pod autoscaler (which manages the load) created more worker processes. However, each new worker process also had to discover the failing replica. Once they did, they too got stuck in the queue waiting for responses to their queries. This caused the problem to escalate, creating a snowball effect.
This incident demonstrates how a minor issue can disrupt infrastructure and load-balancing. For more details, refer to the note on the issue here.?
To mitigate the impact of GitLab downtimes, consider the following precautions:
Schedule regular backups of your GitLab repositories, including the database and configuration files.
Have a clear recovery plan in place, detailing steps to restore service and access backups quickly.