ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

å‘å¸ƒæ—¥æœŸ: 2023å¹´7æœˆ20æ—¥

+ å…³æ³¨

Author:

Soumil Shah

(Lead Data Engineer)

Bachelor of Science in Electronic Engineering and double masterâ€™s in Electrical and Computer Engineering. Python expert. YouTube educator in Data Science, Machine Learning, Elastic search, and AWS. Data collection and processing Team Lead at JobTarget. Extensive experience in developing scalable software applications. Worked with massive data, creating data lakes, optimizing queries, and developing streaming applications.

Website : soumilshah.com

Divyansh Patel

(Data Engineer)

Technophile with a passion for coding. Software developer at JobTarget. Engineering background, now into Cloud and data science. Versatile in Python, C, C++, SQL, AWS, and eager to learn more. Open to connecting with fellow software engineers.

Website : divyanshpatel.com

Introduction

Managing data in modern distributed systems presents various challenges, including handling real-time data streams and maintaining data integrity. Apache Hudi, an influential open-source data management framework, provides ACID (Atomicity, Consistency, Isolation, Durability) guarantees for Big Data workloads, making it highly suitable for managing large-scale data ingestion and backfilling operations. In this article, we delve into the effective backfilling of Apache Hudi tables in a production environment using AWS Glue. Through this process, we ensure data consistency and reliability within our medialn architecture.

Medialn Architecture Overview

In our medialn architecture, user clicks are ingested into Kinesis streams, and then they are stored in Amazon S3 in the Raw Zone. Data is organized in partitions based on the year, month, and day of ingestion. The next step involves using AWS Glue, a fully managed extract, transform, and load (ETL) service, to read data from the Raw Zone and transform it into the Silver Zone.

Approach to Backfilling with Apache Hudi and AWS Glue

When we needed to introduce a new column, "new_col," into our Clicks stream, we wanted to ensure that the Silver Zone also had this new data without introducing any duplicates. To achieve this, we devised the following approach:

Step 1: Identifying the Affected Partition

The first step was to identify the specific partition (month) from where the new field, "new_col," was added to the Clicks stream. This allowed us to target the data that required backfilling.?

Step 2: Temporarily Turning OFF the Regular jobs

As you can see Active is marked as False which will turn off the job

Step 3: Creating Backfill job for partition using Glue ingestion framework.

As you can see in figure we create a Backfilling Job. We are using our Glue Ingestion framework which helps to create and deploy jobs faster.???

é¢†è‹±æŽ¨è

GenAI Dev Stack, LLMOps & Vector Databases!

Pavan Belagatti 1 å¹´å‰

PySpark Introduction: Powering Big Data Processing with Apache Spark

PySpark Introduction: Powering Big Data Processingâ€¦

Eduardo Miranda 7 ä¸ªæœˆå‰

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

Robust Architecture to populate Data from MongoDB inâ€¦

Soumil S. 3 å¹´å‰

Link to Glue Ingestion Framework

https://www.dhirubhai.net/pulse/lakeboostmaximizing-efficiency-data-lake-hudi-glue-etl-soumil-shah%3FtrackingId=bJJzLs%252BATSKdy32wUkl1OQ%253D%253D/?trackingId=bJJzLs%2BATSKdy32wUkl1OQ%3D%3D

Utilizing AWS Glue's ingestion framework, we created a backfilling job. This job read the data from the identified partition in the Raw Zone and performed an "UPSERT" operation into the Silver Zone. This process updated existing records and inserted new ones, ensuring that the "new_col" was added without introducing duplicates.

Step 4: Re-Syncing the Timeline For Regular Jobs

Once the backfilling job was completed successfully and the bookmark and timeline were matched, we ran the regular Glue job again. This time, however, we ran it in "UPSERT" mode. This step helped fill in any missing gaps and synchronize the timeline between the Raw and Silver Zones.

Step 5: Reverting to Bulk Insert Mode

Finally, after the backfill was performed and the timeline was in sync, we reverted the regular Glue job back to its original "BULK INSERT" mode. This ensured that the regular job continued processing incremental updates as usual.

Complete Flow

Advantages and Benefits of the Approach

Data Consistency: By utilizing Apache Hudi's ACID features, we ensured that the backfilling process maintained data consistency across the Silver Zone without introducing duplicates or data integrity issues.
Efficient Backfilling: The approach allowed us to selectively backfill only the affected partition, minimizing unnecessary processing and optimizing resource utilization.
Timely Updates: The regular Glue job continued to process incremental updates, which meant that newly ingested data was quickly propagated to the Silver Zone once the backfill was completed.
Minimal Reruns: Our approach significantly reduced the need for massive reruns or data scans, making the backfilling process more efficient and less resource-intensive.

Conclusion

In conclusion, backfilling Apache Hudi tables in production with AWS Glue can be achieved efficiently and reliably by following a well-structured approach. Leveraging Apache Hudi's ACID capabilities and AWS Glue's ETL capabilities, we successfully introduced new data into the Silver Zone without compromising data integrity or incurring unnecessary overhead. The combination of these powerful technologies allowed us to manage data at scale while ensuring timely updates and minimizing disruptions to our production systems.

Ekta Srivastava

Data Engineer from heart, Data Architect from soul

1 å¹´

Is there a way authenticate Apache Hudi Datalake by Microsoft AD?

èµž

å›žå¤

Udaykiran Noti

Software engineer

1 å¹´

èµž

å›žå¤

nadine farah

Developer Relations

1 å¹´

awesome!

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Soumil S.çš„æ›´å¤šæ–‡ç«

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

2025å¹´3æœˆ29æ—¥

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

When it comes to organizing data for multi-tenant applications, one of the key architectural decisions is how to manageâ€¦
Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

2025å¹´3æœˆ25æ—¥

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Weâ€™ll be diving into AWS Managed Iceberg and exploring the latest features of S3 table buckets. Gain hands-onâ€¦

4 æ¡è¯„è®º
Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025å¹´3æœˆ21æ—¥

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored inâ€¦

1 æ¡è¯„è®º
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025å¹´3æœˆ16æ—¥

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. Withâ€¦

1 æ¡è¯„è®º
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025å¹´3æœˆ13æ—¥

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has becomeâ€¦

1 æ¡è¯„è®º
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025å¹´3æœˆ13æ—¥

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table bucketsâ€¦

4 æ¡è¯„è®º
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025å¹´2æœˆ27æ—¥

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial coversâ€¦

4 æ¡è¯„è®º
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025å¹´2æœˆ25æ—¥

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data isâ€¦

1 æ¡è¯„è®º
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025å¹´2æœˆ23æ—¥

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful tableâ€¦

2 æ¡è¯„è®º
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025å¹´2æœˆ17æ—¥

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processesâ€¦

2 æ¡è¯„è®º

See all articles

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

Author:

Approach to Backfilling with Apache Hudi and AWS Glue

Step 1: Identifying the Affected Partition

Step 2: Temporarily Turning OFF the Regular jobs

Step 3: Creating Backfill job for partition using Glue ingestion framework.

é¢†è‹±æŽ¨è

Soumil S.çš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Understanding the PySpark

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

What is Apache Spark ?

Understanding Apache Spark's Execution Model: From Transformations to Tasks

Introduction to Apache Spark's ML library.

Unleashing the Power of Big Data Processing with Apache Spark

Coupling and Cohesion (Decoupling) from a data engineering perspective.

Why Apache Spark is Not the Only Way Forward for Data Teams

DATA BRICKS

Author:

Approach to Backfilling with Apache Hudi and AWS Glue

Step 1: Identifying the Affected Partition

Step 2: Temporarily Turning OFF the Regular jobs

Step 3: Creating Backfill job for partition using Glue ingestion framework.

é¢†è‹±æŽ¨è

Soumil S.çš„æ›´å¤šæ–‡ç«

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Understanding the PySpark

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

What is Apache Spark ?

Understanding Apache Spark's Execution Model: From Transformations to Tasks

Introduction to Apache Spark's ML library.

Unleashing the Power of Big Data Processing with Apache Spark

Coupling and Cohesion (Decoupling) from a data engineering perspective.

Why Apache Spark is Not the Only Way Forward for Data Teams

DATA BRICKS

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†