登录查看更多内容

CTW's near real-time data pipeline solution

CTW Inc

業界トップクラスの規模を持つ独自ゲームプラットフォーム「G123（ジーイチニサン）」を運営する東京の急成長ゲーム企業です。

发布日期: 2022年11月10日

+ 关注

*Our data wizard York Yu explains a solution he and the team worked on earlier in the year.

The near real-time data pipeline solution

The existing data pipeline

Being a game platform, g123 consumes a lot of different kinds of event data from our games.

Previously, we were using Firehose and Lambda to process the data transformation and dump the events into S3. Given the possibility of duplicated events being produced on the client-side, we also needed to perform data deduplication and have a formatting layer for the raw data. The workflow looked like this:

In this solution, we used AWS-managed features. The code to be maintained was only in the Lambda functions, which helped us to deliver the pipeline at great speed.

Although this solution worked well enough in 2021, it left us with several problems:

The process of data transformation is asynchronous and we couldn’t guarantee that a particular event’s processing had finished when we queried the data unless we created one more kinesis after the lambda function
Although the Kinesis Data Firehose gave us the functionality to dynamically partition the data if we needed more complex partitioning logic, we had to create lambda functions and so lose resource quotas for other services

To solve these problems, we made a small change to the existing solution by replacing Firehose and the lambda function with Flink.

New data pipeline built with Flink

For stability and manageability, we considered this kind of system

Run the Flink cluster on K8S (EKS) in application mode
Enable HA and external state storage for fault tolerance
Use Datadog metrics reporter to expose the job metrics
Process the transformation of a record before storing it into S3
If there is a processing failure, redirect the event to the error output folder using the side-output feature

The Flink job graph looks like this

There are two problems here which cost time

After finishing this version, we realized that, although we migrated the data transformation into Flink, we still had to perform data deduplication and formatting in a separate process. We had to find a solution to simplify the whole pipeline design by doing everything with Flink.

There were two possible ways to reach that goal

Use redis, mysql for deduplication, parquet format for final output.
Use Apache Hudi (a data platform tool) to provide record level deduplication/updates.

We compared the two solutions and chose Hudi for these reasons

Hudi manages the existing index in rocksdb and provides a way to reload the index from existing data, which makes the application roll up easily
Hudi manages ACID transactions, guarantees schema evolution, and has other features like z-ordering which may be helpful with future query performance

New data pipeline built with Hudi on Flink

This is what the system looks like.

Migrating the previous version of the Flink application into this one wasn’t easy.

The big points we had to pay attention to:

Refer to https://github.com/apache/hudi/issues/4283

There are two table types in Hudi

MOR (Merge on Read), high performance on writing

- Calculate index based on incoming data, decide to insert or update

- Write incoming data to Avro format (Row-based) first

- Create a compaction plan

- Compact the data in Avro files into Parquet files by creating a new version

COW (Copy on Write)

- Calculate index based on incoming data, decide to insert or update

- Update existing Parquet files by creating a new version

We compared the two table types based on the data we had. Here is the config of the COW table

Alex Merced 1 个月前

How to Build Your Data Platform like a Product

Barr Moses 4 年前

3 Reasons Data Engineers Should Embrace Apache Iceberg

Alex Merced 6 个月前

The Flink job graph looks like this

The MOR table config is

The Flink job graph looks like this

After testing the configs, we got the below rough results

We expected that MOR would give us better performance with asynchronous compaction, but the result went against our assumption. Checking the Hudi source code, the root reason for the problem is that once Hudi creates the compaction plan, it loads all partition paths and filters using the compaction settings. In this case, the more partitions you have, the more time it will take listing them up, regardless of whether you run compaction asynchronously or synchronously.

According to the Hudi doc, you can enable a metatable to maintain a list of files in storage to avoid the time cost of listing all the partitions, but in our case, we chose the COW table option because it showed the total time cost per checkpoint clearly.

Although Hudi on Flink is really fast at delivering data into S3, there still some tricks to making it faster on high load:

Performance details

We were using

cpu: 4 cores
memory: 12 GB
kinesis shards: 4

According to the weekly graph, the pipeline can handle 100–250 TPS with stable latency.

Once tested by consuming the remaining data, we saw that the throughput was close to the kinesis reading capacity (4 * 1000 records per second).

When we looked deeper into the checkpoint duration details, we saw that the current file size increased along with the processing time. But once the new file started to be written, the processing time dropped rapidly.

Summary

Using Hudi on Flink, we were able to synchronize separate data processing logic. According to the performance testing results, the application met our speed requirements. As opposed to Hudi on Spark, Hudi on Flink maintained state storage to store the records index, which saves a lot of time usage for the loading index, but this is also a double-edged sword, depending on different business needs. In short, the reasonable use of Hudi on Flink solved our technical problems and met our business needs in one shot.

_________

For more information about our available positions, have a look below.?

We’re actively looking for talent in the following roles.

要查看或添加评论，请登录

CTW Inc的更多文章

See all articles

CTW's near real-time data pipeline solution

CTW Inc

業界トップクラスの規模を持つ独自ゲームプラットフォーム「G123（ジーイチニサン）」を運営する東京の急成長ゲーム企業です。

The near real-time data pipeline solution

The existing data pipeline

New data pipeline built with Flink

New data pipeline built with Hudi on Flink

领英推荐

Performance details

Summary

Front-end Engineers

Backend Engineers: Golang + Python

Dev Ops Engineers

NLP Research Scientist

Data Engineer — Scala

Product Manager

CTW Inc的更多文章

社区洞察

其他会员也浏览了

Big Data: The 5 Vs Everyone Must Know

Introduction to Big Data World

Expanding the power of your data lake. The tip of the iceberg

The Three Pillars of Data Observability: Channels, Observation Model, and Expectations

What is a data lake?

Big data: Big problem, big opportunity or big hype?

Hands-On Lab: Unleashing Efficiency and Flexibility with Partial Updates in Apache Hudi

BIG DATA

Why big data is actually small?

Data Quality and Data Observability becomes easier. Five main trends at Big Data Tech Warsaw 2021 - Part 3/5.

The near real-time data pipeline solution

The existing data pipeline

New data pipeline built with Flink

New data pipeline built with Hudi on Flink

领英推荐

Performance details

Summary

Front-end Engineers

Backend Engineers: Golang + Python

Dev Ops Engineers

NLP Research Scientist

Data Engineer — Scala

Product Manager

CTW Inc的更多文章

Meet CTW Lead UI/UX Designer, Bi Wen

Meet CTW’s VP of Artificial Intelligence, Neil Yuki

Meet CTW’s Jr. Software Engineer and Developers: Randy, Bas, and Aogu

Meet CTW’s Frontend Wizard, Li Ming

Meet CTW’s Game Planner Leader, Murakami Takuto

Meet CTW’s Staff Data Engineer, Yu (York) Zheren

Meet CTW’s Marketing Director, Otomo Yuiko

Meet CTW’s VP of Engineering, Cao Zhe

Meet CTW's “Coach” Yin Yannan

Meet CTW’s Game Ops Lead, Ellen Dekkers

社区洞察

其他会员也浏览了

Big Data: The 5 Vs Everyone Must Know

Introduction to Big Data World

Expanding the power of your data lake. The tip of the iceberg

The Three Pillars of Data Observability: Channels, Observation Model, and Expectations

What is a data lake?

Big data: Big problem, big opportunity or big hype?

Hands-On Lab: Unleashing Efficiency and Flexibility with Partial Updates in Apache Hudi

BIG DATA

Why big data is actually small?

Data Quality and Data Observability becomes easier. Five main trends at Big Data Tech Warsaw 2021 - Part 3/5.