CTW's near real-time data pipeline solution
*Our data wizard York Yu explains a solution he and the team worked on earlier in the year.
The near real-time data pipeline solution
The existing data pipeline
Being a game platform, g123 consumes a lot of different kinds of event data from our games.
Previously, we were using Firehose and Lambda to process the data transformation and dump the events into S3. Given the possibility of duplicated events being produced on the client-side, we also needed to perform data deduplication and have a formatting layer for the raw data. The workflow looked like this:
In this solution, we used AWS-managed features. The code to be maintained was only in the Lambda functions, which helped us to deliver the pipeline at great speed.
Although this solution worked well enough in 2021, it left us with several problems:
To solve these problems, we made a small change to the existing solution by replacing Firehose and the lambda function with Flink.
New data pipeline built with Flink
For stability and manageability, we considered this kind of system
The Flink job graph looks like this
There are two problems here which cost time
After finishing this version, we realized that, although we migrated the data transformation into Flink, we still had to perform data deduplication and formatting in a separate process. We had to find a solution to simplify the whole pipeline design by doing everything with Flink.
There were two possible ways to reach that goal
We compared the two solutions and chose Hudi for these reasons
New data pipeline built with Hudi on Flink
This is what the system looks like.
Migrating the previous version of the Flink application into this one wasn’t easy.
The big points we had to pay attention to:
There are two table types in Hudi
- Calculate index based on incoming data, decide to insert or update
- Write incoming data to Avro format (Row-based) first
- Create a compaction plan
- Compact the data in Avro files into Parquet files by creating a new version
- Calculate index based on incoming data, decide to insert or update
- Update existing Parquet files by creating a new version
We compared the two table types based on the data we had. Here is the config of the COW table
领英推荐
The Flink job graph looks like this
The MOR table config is
The Flink job graph looks like this
After testing the configs, we got the below rough results
We expected that MOR would give us better performance with asynchronous compaction, but the result went against our assumption. Checking the Hudi source code, the root reason for the problem is that once Hudi creates the compaction plan, it loads all partition paths and filters using the compaction settings. In this case, the more partitions you have, the more time it will take listing them up, regardless of whether you run compaction asynchronously or synchronously.
According to the Hudi doc, you can enable a metatable to maintain a list of files in storage to avoid the time cost of listing all the partitions, but in our case, we chose the COW table option because it showed the total time cost per checkpoint clearly.
Although Hudi on Flink is really fast at delivering data into S3, there still some tricks to making it faster on high load:
Performance details
We were using
According to the weekly graph, the pipeline can handle 100–250 TPS with stable latency.
Once tested by consuming the remaining data, we saw that the throughput was close to the kinesis reading capacity (4 * 1000 records per second).
When we looked deeper into the checkpoint duration details, we saw that the current file size increased along with the processing time. But once the new file started to be written, the processing time dropped rapidly.
Summary
Using Hudi on Flink, we were able to synchronize separate data processing logic. According to the performance testing results, the application met our speed requirements. As opposed to Hudi on Spark, Hudi on Flink maintained state storage to store the records index, which saves a lot of time usage for the loading index, but this is also a double-edged sword, depending on different business needs. In short, the reasonable use of Hudi on Flink solved our technical problems and met our business needs in one shot.
_________
For more information about our available positions, have a look below.?
We’re actively looking for talent in the following roles.