Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

Author:

Soumil Shah

(Lead Data Engineer)

Bachelor of Science in Electronic Engineering and double master’s in Electrical and Computer Engineering. Python expert. YouTube educator in Data Science, Machine Learning, Elastic search, and AWS. Data collection and processing Team Lead at JobTarget. Extensive experience in developing scalable software applications. Worked with massive data, creating data lakes, optimizing queries, and developing streaming applications.

Website : soumilshah.com


Divyansh Patel

(Data Engineer)

Technophile with a passion for coding. Software developer at JobTarget. Engineering background, now into Cloud and data science. Versatile in Python, C, C++, SQL, AWS, and eager to learn more. Open to connecting with fellow software engineers.

Website : divyanshpatel.com


Introduction

Managing data in modern distributed systems presents various challenges, including handling real-time data streams and maintaining data integrity. Apache Hudi, an influential open-source data management framework, provides ACID (Atomicity, Consistency, Isolation, Durability) guarantees for Big Data workloads, making it highly suitable for managing large-scale data ingestion and backfilling operations. In this article, we delve into the effective backfilling of Apache Hudi tables in a production environment using AWS Glue. Through this process, we ensure data consistency and reliability within our medialn architecture.

Medialn Architecture Overview

In our medialn architecture, user clicks are ingested into Kinesis streams, and then they are stored in Amazon S3 in the Raw Zone. Data is organized in partitions based on the year, month, and day of ingestion. The next step involves using AWS Glue, a fully managed extract, transform, and load (ETL) service, to read data from the Raw Zone and transform it into the Silver Zone.

No alt text provided for this image

Approach to Backfilling with Apache Hudi and AWS Glue

When we needed to introduce a new column, "new_col," into our Clicks stream, we wanted to ensure that the Silver Zone also had this new data without introducing any duplicates. To achieve this, we devised the following approach:

Step 1: Identifying the Affected Partition

The first step was to identify the specific partition (month) from where the new field, "new_col," was added to the Clicks stream. This allowed us to target the data that required backfilling.?

Step 2: Temporarily Turning OFF the Regular jobs

No alt text provided for this image

As you can see Active is marked as False which will turn off the job

No alt text provided for this image

Step 3: Creating Backfill job for partition using Glue ingestion framework.

As you can see in figure we create a Backfilling Job. We are using our Glue Ingestion framework which helps to create and deploy jobs faster.???

No alt text provided for this image



Link to Glue Ingestion Framework

https://www.dhirubhai.net/pulse/lakeboostmaximizing-efficiency-data-lake-hudi-glue-etl-soumil-shah%3FtrackingId=bJJzLs%252BATSKdy32wUkl1OQ%253D%253D/?trackingId=bJJzLs%2BATSKdy32wUkl1OQ%3D%3D




No alt text provided for this image

Utilizing AWS Glue's ingestion framework, we created a backfilling job. This job read the data from the identified partition in the Raw Zone and performed an "UPSERT" operation into the Silver Zone. This process updated existing records and inserted new ones, ensuring that the "new_col" was added without introducing duplicates.


Step 4: Re-Syncing the Timeline For Regular Jobs

Once the backfilling job was completed successfully and the bookmark and timeline were matched, we ran the regular Glue job again. This time, however, we ran it in "UPSERT" mode. This step helped fill in any missing gaps and synchronize the timeline between the Raw and Silver Zones.

No alt text provided for this image
No alt text provided for this image

Step 5: Reverting to Bulk Insert Mode

Finally, after the backfill was performed and the timeline was in sync, we reverted the regular Glue job back to its original "BULK INSERT" mode. This ensured that the regular job continued processing incremental updates as usual.

No alt text provided for this image


No alt text provided for this image

Complete Flow

No alt text provided for this image


Advantages and Benefits of the Approach

  1. Data Consistency: By utilizing Apache Hudi's ACID features, we ensured that the backfilling process maintained data consistency across the Silver Zone without introducing duplicates or data integrity issues.
  2. Efficient Backfilling: The approach allowed us to selectively backfill only the affected partition, minimizing unnecessary processing and optimizing resource utilization.
  3. Timely Updates: The regular Glue job continued to process incremental updates, which meant that newly ingested data was quickly propagated to the Silver Zone once the backfill was completed.
  4. Minimal Reruns: Our approach significantly reduced the need for massive reruns or data scans, making the backfilling process more efficient and less resource-intensive.


Conclusion

In conclusion, backfilling Apache Hudi tables in production with AWS Glue can be achieved efficiently and reliably by following a well-structured approach. Leveraging Apache Hudi's ACID capabilities and AWS Glue's ETL capabilities, we successfully introduced new data into the Silver Zone without compromising data integrity or incurring unnecessary overhead. The combination of these powerful technologies allowed us to manage data at scale while ensuring timely updates and minimizing disruptions to our production systems.

Ekta Srivastava

Data Engineer from heart, Data Architect from soul

1 年

Is there a way authenticate Apache Hudi Datalake by Microsoft AD?

回复
Udaykiran Noti

Software engineer

1 年

Cf

回复
nadine farah

Developer Relations

1 年

awesome!

回复

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了