CTW's near real-time data pipeline solution

CTW's near real-time data pipeline solution

*Our data wizard York Yu explains a solution he and the team worked on earlier in the year.

The near real-time data pipeline solution

The existing data pipeline

Being a game platform, g123 consumes a lot of different kinds of event data from our games.

Previously, we were using Firehose and Lambda to process the data transformation and dump the events into S3. Given the possibility of duplicated events being produced on the client-side, we also needed to perform data deduplication and have a formatting layer for the raw data. The workflow looked like this:

No alt text provided for this image

In this solution, we used AWS-managed features. The code to be maintained was only in the Lambda functions, which helped us to deliver the pipeline at great speed.

Although this solution worked well enough in 2021, it left us with several problems:

  1. The process of data transformation is asynchronous and we couldn’t guarantee that a particular event’s processing had finished when we queried the data unless we created one more kinesis after the lambda function
  2. Although the Kinesis Data Firehose gave us the functionality to dynamically partition the data if we needed more complex partitioning logic, we had to create lambda functions and so lose resource quotas for other services

To solve these problems, we made a small change to the existing solution by replacing Firehose and the lambda function with Flink.

New data pipeline built with Flink

No alt text provided for this image

For stability and manageability, we considered this kind of system

  • Run the Flink cluster on K8S (EKS) in application mode
  • Enable HA and external state storage for fault tolerance
  • Use Datadog metrics reporter to expose the job metrics
  • Process the transformation of a record before storing it into S3
  • If there is a processing failure, redirect the event to the error output folder using the side-output feature

No alt text provided for this image

The Flink job graph looks like this

No alt text provided for this image

There are two problems here which cost time

No alt text provided for this image

After finishing this version, we realized that, although we migrated the data transformation into Flink, we still had to perform data deduplication and formatting in a separate process. We had to find a solution to simplify the whole pipeline design by doing everything with Flink.

There were two possible ways to reach that goal

  1. Use redis, mysql for deduplication, parquet format for final output.
  2. Use Apache Hudi (a data platform tool) to provide record level deduplication/updates.

We compared the two solutions and chose Hudi for these reasons

  • Hudi manages the existing index in rocksdb and provides a way to reload the index from existing data, which makes the application roll up easily
  • Hudi manages ACID transactions, guarantees schema evolution, and has other features like z-ordering which may be helpful with future query performance

New data pipeline built with Hudi on Flink

No alt text provided for this image

This is what the system looks like.

Migrating the previous version of the Flink application into this one wasn’t easy.

The big points we had to pay attention to:

No alt text provided for this image

Refer to https://github.com/apache/hudi/issues/4283

There are two table types in Hudi

  • MOR (Merge on Read), high performance on writing

- Calculate index based on incoming data, decide to insert or update

- Write incoming data to Avro format (Row-based) first

- Create a compaction plan

- Compact the data in Avro files into Parquet files by creating a new version

  • COW (Copy on Write)

- Calculate index based on incoming data, decide to insert or update

- Update existing Parquet files by creating a new version

We compared the two table types based on the data we had. Here is the config of the COW table

No alt text provided for this image

The Flink job graph looks like this

No alt text provided for this image

The MOR table config is

No alt text provided for this image

The Flink job graph looks like this

No alt text provided for this image

After testing the configs, we got the below rough results

No alt text provided for this image

We expected that MOR would give us better performance with asynchronous compaction, but the result went against our assumption. Checking the Hudi source code, the root reason for the problem is that once Hudi creates the compaction plan, it loads all partition paths and filters using the compaction settings. In this case, the more partitions you have, the more time it will take listing them up, regardless of whether you run compaction asynchronously or synchronously.

According to the Hudi doc, you can enable a metatable to maintain a list of files in storage to avoid the time cost of listing all the partitions, but in our case, we chose the COW table option because it showed the total time cost per checkpoint clearly.

Although Hudi on Flink is really fast at delivering data into S3, there still some tricks to making it faster on high load:

No alt text provided for this image

Performance details

We were using

  • cpu: 4 cores
  • memory: 12 GB
  • kinesis shards: 4

No alt text provided for this image

According to the weekly graph, the pipeline can handle 100–250 TPS with stable latency.

Once tested by consuming the remaining data, we saw that the throughput was close to the kinesis reading capacity (4 * 1000 records per second).

No alt text provided for this image

When we looked deeper into the checkpoint duration details, we saw that the current file size increased along with the processing time. But once the new file started to be written, the processing time dropped rapidly.

No alt text provided for this image

Summary

Using Hudi on Flink, we were able to synchronize separate data processing logic. According to the performance testing results, the application met our speed requirements. As opposed to Hudi on Spark, Hudi on Flink maintained state storage to store the records index, which saves a lot of time usage for the loading index, but this is also a double-edged sword, depending on different business needs. In short, the reasonable use of Hudi on Flink solved our technical problems and met our business needs in one shot.

_________

For more information about our available positions, have a look below.?

We’re actively looking for talent in the following roles.

Front-end Engineers

https://lnkd.in/dAbMnXm7

Backend Engineers: Golang + Python

https://lnkd.in/gg-hGe3v

https://lnkd.in/g-v_Yfwa

Dev Ops Engineers

https://lnkd.in/grdpA4TQ

NLP Research Scientist

https://lnkd.in/ghMCPxVg

Data Engineer — Scala

https://lnkd.in/ghxFu4G8

Product Manager

https://lnkd.in/gC_fk9WB

要查看或添加评论,请登录

CTW Inc的更多文章

社区洞察

其他会员也浏览了