System design study: How Flipkart Scaled TiDB to 1M QPS
Problem
As per Flipkart’s blog, most microservices relied on the MySQL cluster for data storage. This helped them get ACID properties and low latency for high throughput applications. As user traffic grew on Flipkart, they tried vertical scaling to their existing resources but they needed a long-term solution that’s horizontally scalable.
There aren’t clear reasons mentioned in the blog for going away from MySQL, but ensuring ACID properties & low latency across a high number of nodes is a tough challenge in MySQL. Read this blog for why is it hard to horizontally scale SQL databases.
Solution
As they considered a new alternative solution for MySQL, they looked out for TiDB which was already in use for some use cases at Flipkart: around 60k QPS for reads, and 15k QPS for writes.
As per the official website of PingCap (the parent company that developed TiDB), the database TiDB is a great solution for SQL storage for a lot of reasons. One of them is support for automatic sharding which provides built-in horizontal scalability across nodes without having to do any manual sharding.
There’s a blog that tries to test TiDB performance against MySQL on a single server. The observation is that TiDB is quite fast for OLAP reads (select queries that are complex and do not benefit from indexes) because it supports parallel query execution and utilizes multiple CPU cores whereas MySQL is limited to a single core for select queries.
Setup
There’s a lot of detail about resources being deployed which you can find in the main blog. In a nutshell, for reads: a mix of 12 SELECT statements were selected as used in production. Similarly, for writes: A single database transaction that executes 10 INSERT statements on 5 different tables was selected. I guess this randomization was done to get the true throughput from the database.
Scaling the database
After the initial setup, the teams made a bunch of improvements. The P99 latency for reads stood at 7.4ms at 1.04 million QPS. For the writes workload, the P99 latency at 123K QPS stood at 13.1 ms.
Beyond the above-mentioned throughput, the CPU contention(multiple processes fighting for the CPU’s resources) caused an increase in latency.
But the journey till 1M read QPS also presented some interesting challenges which teams at Flipkart solved by doing the following:
领英推荐
Conclusion
Here are a few conclusion remarks:
The top learning for me in this article was reading about the TiDB database and how it can act as an alternative to MySQL storage. Kudos to the Flipkart team and the author Sachin Japate for sharing learnings on this.
That’s it, folks for this edition of the newsletter. Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!
Thanks for reading Curious Engineer! Subscribe for free to receive new posts and support my work.
Engineering@Fairmatic | Ex-Attentive | TIET'21 |
7 个月Really informative, just wanted to know how prepared statements are different from indexes?
Senior Software Engineer, Ex-Amazon, Ex-Oracle
7 个月Thanks for sharing