System Design Case Study: Mysql High-Availability by Flipkart
If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched article every week delivered straight to your inbox.
Brief Background
Flipkart has thousands of microservices. These microservices were initially deployed on on-premise data centers. On-premise data centers are those physical machines owned by the company itself. Then, for some time, Flipkart used public cloud networks for hosting data centers and is now using a hybrid approach (both on-premise and public cloud networks).
These microservices use MySQL for their data storage.
Why MySQL?
Well, the microservices include logistics, supply chain, order management services, etc. The tables for storing data produced by these services are highly relational meaning one-row record in a specific table should know about the other records in other tables so that we can perform JOINS on multiple tables and get the desired data in one query. That’s why MySQL seemed to be an appropriate choice here.
Problem
The problem was something that every company would usually face when they want to grow from 0 → 1 and ship things fast.
Initially, every team at Flipkart used to host their own MySQL clusters. This is a problem because of the following reasons:
Solution
Flipkart realised that they need to centralize things so that context on how to get the best of everything which is discussed above is built over time within one team or group of teams.
Enters Flipkart homegrown solution — ALTAIR
ALTAIR is Flipkart’s managed MySQL as a service that takes away all the common developer concerns for hosting a database as well as making changes to it in the future. Let’s comprehend it using the below image for what folks at Flipkart have developed:
What is Availability and why it’s important?
Availability is referred to as the probability that your service will be able to respond with a successful response. Now, if you are a Software Engineer, you know that the probability that your service can respond to every request successfully is definitely not 100% because of production bugs, runtime issues, etc.
Instead, what we can try to achieve is called high availability which means defining an SLA(service level agreement) i.e. our service is going to be some percentage available at all times.
If availability is 99.0 percent, it is stated to be “2 nines,” and if it is 99.9 percent, it is called “3 nines,” and so on. Thus, every service in all architectures is trying to achieve high availability by chasing as many “nines” as possible. FYI: It’s a little difficult to chase beyond four “nines” i.e. 99.99% availability.
Flipkart is a big company and has multiple services. Due to multiple MySQL clusters being managed by individual teams, the architecture is prone to being less available as the MySQL clusters wouldn’t be configured in the best configuration.
Flipkart’s decision to design MySQL as a service by proposing the ALTAIR system is an effort towards removing the developer’s database-related concerns and thus chasing the high availability of the overall ecosystem.
Introduction: ALTAIR
ALTAIR works on a trivial yet powerful MySQL configuration: primary-replica aka master-slave
The primary instance has a critical responsibility i.e. handling write requests. Thus, the master instance must be highly available. However, instance failures in on-premise data centers are common which can lead to lower availability and thus degraded customer experience. Thus, when a master instance is failing or fails, the monitoring system of the recovery workflow should trigger the process of promoting a slave instance to be the new primary.
This is called a failover process and post the failover, clients should be able to find the new primary and redirect the writes to the new primary.
If you connect the dots, the failover process is directly tied to the high availability of the ecosystem. Whenever, due to any reason, the services are facing low availability or might face in the future, we must trigger a failover process. This makes the failover process a critical piece to achieve high availability and thus we need to ensure the failover process CANNOT FAIL at any cost.
Before we deep dive into the failover process steps, let’s deep dive into the ALTAIR architecture.
Architecture of ALTAIR
Key components inside ALTAIR include MySQL instance(primary-replica), Agent, monitor, zookeeper, and orchestrator.
领英推荐
Here’s a brief overview of what’s happening inside ALTAIR:
Now that we have understood how the ALTAIR failover process is designed, let’s deep dive into the actual failover process to understand more.
Steps for an on-premise failover process
Summing up what we discussed as part of ALTAIR’s design, Flipkart describes the failover process in four steps:
Step 1: Failure Detection
This step consists of determining that the primary instance has started failing because of reasons like power loss, hardware failure, planned maintenance or security upgrade, etc.
Step 2: Detection of false positives
This step consists of detecting false positives. If it’s a false positive case, then the failover process need not be triggered.
Step 3: The failover
This step consists of performing the actual failover process. Here’s a brief overview of what happens behind the scenes:
Step 4: Service Discovery
ALTAIR uses DNS(domain name-server) for service discovery. Clients use DNS to find the primary instance that translates to an IP address. After a successful fail-over, ALTAIR updates the DNS for the new primary instance and client applications start using the new primary without restarting.
Note: ALTAIR manages the MySQL clusters with the primary replica configuration where all the replicas are asynchronously copying data from the primary. Due to async replication, zero data loss cannot be guaranteed. Thus, Flipkart provides managed TiDB as a solution for all those services that cannot tolerate data loss.
The split-brain problem
Apart from what we discussed above, there is a very interesting problem that Flipkart’s blog talked about and I would like to highlight here is the split-brain problem.
Imagine the following situation:
This is a fatal problem as we have two primary instances are we are accepting writes in two different databases. Imagine the horror: The Flipkart customer might have placed the order successfully but immediately in the next refresh, they might not be able to see the order in their account history.
An idea about how fatal is this problem: GitHub faced a split-brain issue in 2018 where they accepted writes in multiple data centers. While they restored connectivity in only 43 seconds, they took approx. next 24 hours just to reconcile the data that resulted from the split writes in two data centers.
Thus, it’s absolutely critical to stop the old primary instance before promoting a replica instance as the new primary. If ALTAIR is not able to make sure that old primary instance is down, human intervention is required to resolve and do the failover. Flipkart is trying to move away from human intervention in the future.
Kudos to Animesh Agarwal for writing a well-put-out article. ??
Book exclusive 1:1 with me here .
That’s it, folks for this edition of the newsletter. Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!
Thanks for reading Curious Engineer! Subscribe for free to receive new posts and support my work.
Resources
Process Engineer with an expertise in process design and implementation | 10 years of experience
7 个月Hi are you available for a paid altair simulation project? please connect thanks
Software Engineer | Java | Spring Boot | Angular | Python | Django | JavaScript | React | Node.js | Ex-Capgemini
8 个月Harsh Vardhan Chowdhary
SDE 2 @Microsoft | Ex. SDE @Flipkart | Problem Solver |FullStack Developer | Computer Science
9 个月Good read. Thanks for sharing!!.
SDE @Groww || Google Summer Of Code 20 || Open Source Dev || IIIT
9 个月Great article! Just curious to know How the data consistency was taken care of as there can be replication lag, and i believe consistency is on priority for the customer centric data right ? also when the slave is getting promoted to master / primary , how is that decision is made which slave to pick for primary , is there any kind of voting which is done by slaves itself or decision is taken on basis of in-sync replica only?
Software Engineer || AI || Open-Source Contributor?
9 个月Thank you for sharing bhaiyya