Arcturus — Inventory Processing System
1. Introduction.
For a long time, handling inventory had been such a pain point for our TIKI system. We couldn’t serve the massive traffic in the peak sales event and prevent to sell over the limited quantity of flash deals. Each event like Gi?t C? H?n, legendary sale seasons, is a nightmare with us. We had to build a separate website or do many tricks to serve our customers. But the result was not encouraged. Customer’s orders were canceled because the system sold over its inventory, or lost the game without understanding why. All of these problems came from inventory processing. Our system couldn’t guarantee the consistency of data at the peak sales event. We had only two options: consistency of data but overloaded at peak time, or inconsistency but alive — a really bad option and a bad option. Of course, we had to choose the second option. We accepted to sell over the quantity of some hot products sometimes and had to cancel orders and compensated coupons for customers. But It was really suffering, stressful, and exhausted. At each peak time, our business team had to watch the performance report to turn off hot SKUs manually. But time over time, our business grew faster and faster. It grew too rabbit. We couldn’t accept bad customer experiences like that anymore. In June/2018, the challenge was pushed in our roadmap, we had to rebuild the catalog platform to scale and adapt to our business. But we only could start the project in June/2019 when we understood the problems, technology thoroughly and gathered enough the best engineers. We released at the end of September/2019, right before 10.10.2019. Since that time, our system proves the reliability and scalability to serve business growth.
2. Inventory Problems.
Inventory is crucial for an eCommerce system. Every transaction is based on the price and quantity of a product. The reliability of the system relies on the inventory so much. To scale the business, the system has to handle many critical problems of inventory. There are two critical problems:
3. Approaches.
Dealing with the consistency inventory’s data to scale is the challenge to design the system. To tackle these problems, these approaches are:
4. Architecture.
The most challenging is to handle the consistency of the inventory’s data. The inventory’s data requires high consistency. The system doesn’t allow to sell more than the quantity of each SKU. The straightforward way is to use a locking mechanism. But in this way, it causes bad performance and can’t scale to handle a large volume of traffic, especially at peak time when thousands of customers are racing to place orders. To solve this problem, the architecture has to satisfy two critical things:
4.1. Consistency Model.
To deal with the eventual consistency problems, the design is based on the transaction log model. The state of the system is not only the state of data in the database but also the state of each command to update the inventory’s data. So the state of the system has been changed since it received the update command from external systems (checkout, warehouse, order management system…). Each state change of the system is stored in a command queue with a specific offset. The offset of the state is used as the checkpoint to recover the state of the system anytime. It is the main idea to guarantee the eventual consistency of the data.
The system has one command queue. If an external system wants to update the inventory’s data, it will send a command to the inventory system. The command is stored in a command queue. The command queue is used as a transaction log to maintain the state of the system. There are two main data sources that need to be eventually consistent: the command queue and the persistent database. The offset of the persistent database is always less than or equal to the latest offset of the command queue.
The algorithm to maintain the eventual consistency is:
By applying this model, the system solves the eventual consistency problem properly. It guarantees the eventual consistency between all data sources. This mechanism is the backbone to scale the system later, it opens many solutions to handle high concurrency requests and keep the reliabilities of the whole system. People usually say: “There are only two hard things in Computer Science: cache invalidation and naming things.”. But with this model, the system is reliable to replicate data between many data sources: command queue, local memory, database, and server nodes.
4.2. Processing Model.
The system has to handle a thousand transactions per second when customers race to place orders. Many customers can buy one SKU and one customer can buy many SKU at the same time. The system has to solve two hard problems: high throughput but guarantee the consistency of data. These problems require different optimization strategies and these strategies usually conflict with each other. If the system wants to maximize the throughput, it will cause the race condition to update data, and to guarantee the consistency of data, it will block the processing of the other.
The common approach is to optimize the CPU to handle multiple tasks at the same time. But actually, the CPU is not good at performance when It has to handle multiple tasks at the same time. The CPU has to switch the context to handle separated tasks, whenever It switches the context, it slows down the processing. It is worse when the system has to handle more customers at peak times. More than that, the system has to guarantee the consistency of inventory data. If many customers race to place an SKU, there will be many threads that are trying to update the inventory data. If the CPU serves one thread, it has to block the other threads and switches to the other threads after finishing. It will cause more context switching and decrease performance and throughput so much. To maximize the power of the CPU, the system has to decrease the context switching. There are two approaches to tackle this problem: reducing the number of concurrent threads and store the data as nearer to the CPU as possible.
All the inventory data is stored in the local memory. It is loaded from the database if it is not available in the local memory. All data changes are published to another queue to persist in batches asynchronously. By applying the above consistency model, the system can guarantee the eventual consistency between the command queue, local memory, and the database. The data flow is one way, loading data, processing, and writing data are separated in different flows. Based on the offset of the data, the system can manage the consistency easily. It can decide to consume commands to replay the state of the data in memory or skip. Because of one-way data flow, it also removes the lock at the database and optimizes the I/O bottleneck much.
When the data is ready and available in the local memory, the CPU can access the data quickly and easily. How to optimize the performance of the CPU is a challenge. Instead of dealing with the multiple threading problem, the system is designed to use a single thread to process transactions. It is the best way to deal with high concurrency requests. It reduces the context switching and maximizes the throughput of processing. There is a queue to buffer commands to process sequentially. The system uses a special non-blocking queue: ring buffer. It is based on a famous open-source: Disruptor L-Max. It can handle up to 1 million transactions per second. There are two rings: one for buffering commands to process and one for buffering the data changes to persist in the database asynchronously.
4.3. Checkout Integration.
The consistency data of order and inventory are crucial. It requires the integrity and consistency of the number of sold items with the quantity of each SKU. When a customer places an order, systems have to update a lot of data. There are two important data: order and inventory. These data are managed by two separate systems: checkout and inventory. These systems expose HTTP API for integration. Because of the property of HTTP, the system can’t create an atomic transaction to update the order and inventory data of two systems. So there is a risk for the inconsistency of order and inventory data: the inventory’s data is updated for a new order, but the order can’t be created successfully, then the inventory of an SKU is lost and can’t be used for the next customers. To solve this problem, the processing is separated into two phases:
领英推荐
4.4. Warehouse Integration.
The inventory is updated by both consumers and the warehouse system. The inventory data come from the warehouse system and will be subtracted by the customers when they place orders. At the warehouse system, the operation is very complex, the product is in or out of the warehouse very frequently. The data flow from warehouses is crucial for all operations of TIKI’s eCommerce platform. It maintains the accuracy of the inventory of each product. We had approached by separating this flow from the consumer flow. It is very normal thinking, one system for streaming data from the warehouse, one system for consumers to update inventory. We had spent a few months before completing this module, named this module Ursa. Actually, Its logic is much complex than the logic of consumers. After finishing Ursa, we started the Arcturus and realized that by separating into two systems, we have to maintain the duplicated logic of the quantity of the product. Especially, It is very hard to maintain the consistency of data. The inventory data is updated by two different flows, but its data is stored in two data sources: in-memory and the persistent layer — MySQL. Rather than trying to fix the drawback of this architecture, we stick to the beginning approach by using a single command queue to maintain the update command. We treat all sources of changes equally, no matter from warehouses or from consumers. In this way, we apply one model consistently. It is much easier to maintain the logic and consistency of data.
4.5. Overall Architecture.
The system is built based on:
4.6. Benchmark.
With only processing transaction in memory, don’t care database, the inventory processor can process with rate?350k transaction / s?with multi-operator on 1 key.
And testing with full IO (input request from ZeroMQ, output to MySQL), the inventory processor can handle?120k transaction / s?(not capture picture, so do not anything to show!)
Throughput
On the Staging, we can handle up to?5600 requests/ seconds:
Latency
Cache Hit.
AVG cache hit: 99%
Load database rate: 0.3 requests per second
5. Conclusion.
To build a product, we have to solve many problems systematically. The most challenge of this project is to combine many approaches and solutions into one plan to solve once time. We were not reactive to tackle problems one by one. Because the reliability of the system relies on each problem. There are two very important strategies to build this project successfully:
6. Contributors.
Arcturus is one of the most important projects of TIKI. It is complex, hard and so sensitive. Fortunately, TIKI has the right ones at the right time. It was built by the best ones of TIKI and got much support from the whole company to complete and release under the high pressure of the business. The main contributors are: