Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC
Author:
Soumil Shah
(Lead Data Engineer)
Bachelor of Science in Electronic Engineering and double master’s in Electrical and Computer Engineering. Python expert. YouTube educator in Data Science, Machine Learning, Elastic search, and AWS. Data collection and processing Team Lead at JobTarget. Extensive experience in developing scalable software applications. Worked with massive data, creating data lakes, optimizing queries, and developing streaming applications.
Website : soumilshah.com
Divyansh Patel
(Data Engineer)
Technophile with a passion for coding. Software developer at JobTarget. Engineering background, now into Cloud and data science. Versatile in Python, C, C++, SQL, AWS, and eager to learn more. Open to connecting with fellow software engineers.
Website : divyanshpatel.com
Introduction
Managing data in modern distributed systems presents various challenges, including handling real-time data streams and maintaining data integrity. Apache Hudi, an influential open-source data management framework, provides ACID (Atomicity, Consistency, Isolation, Durability) guarantees for Big Data workloads, making it highly suitable for managing large-scale data ingestion and backfilling operations. In this article, we delve into the effective backfilling of Apache Hudi tables in a production environment using AWS Glue. Through this process, we ensure data consistency and reliability within our medialn architecture.
Medialn Architecture Overview
In our medialn architecture, user clicks are ingested into Kinesis streams, and then they are stored in Amazon S3 in the Raw Zone. Data is organized in partitions based on the year, month, and day of ingestion. The next step involves using AWS Glue, a fully managed extract, transform, and load (ETL) service, to read data from the Raw Zone and transform it into the Silver Zone.
Approach to Backfilling with Apache Hudi and AWS Glue
When we needed to introduce a new column, "new_col," into our Clicks stream, we wanted to ensure that the Silver Zone also had this new data without introducing any duplicates. To achieve this, we devised the following approach:
Step 1: Identifying the Affected Partition
The first step was to identify the specific partition (month) from where the new field, "new_col," was added to the Clicks stream. This allowed us to target the data that required backfilling.?
Step 2: Temporarily Turning OFF the Regular jobs
As you can see Active is marked as False which will turn off the job
Step 3: Creating Backfill job for partition using Glue ingestion framework.
As you can see in figure we create a Backfilling Job. We are using our Glue Ingestion framework which helps to create and deploy jobs faster.???
领英推荐
Link to Glue Ingestion Framework
Utilizing AWS Glue's ingestion framework, we created a backfilling job. This job read the data from the identified partition in the Raw Zone and performed an "UPSERT" operation into the Silver Zone. This process updated existing records and inserted new ones, ensuring that the "new_col" was added without introducing duplicates.
Step 4: Re-Syncing the Timeline For Regular Jobs
Once the backfilling job was completed successfully and the bookmark and timeline were matched, we ran the regular Glue job again. This time, however, we ran it in "UPSERT" mode. This step helped fill in any missing gaps and synchronize the timeline between the Raw and Silver Zones.
Step 5: Reverting to Bulk Insert Mode
Finally, after the backfill was performed and the timeline was in sync, we reverted the regular Glue job back to its original "BULK INSERT" mode. This ensured that the regular job continued processing incremental updates as usual.
Complete Flow
Advantages and Benefits of the Approach
Conclusion
In conclusion, backfilling Apache Hudi tables in production with AWS Glue can be achieved efficiently and reliably by following a well-structured approach. Leveraging Apache Hudi's ACID capabilities and AWS Glue's ETL capabilities, we successfully introduced new data into the Silver Zone without compromising data integrity or incurring unnecessary overhead. The combination of these powerful technologies allowed us to manage data at scale while ensuring timely updates and minimizing disruptions to our production systems.
Data Engineer from heart, Data Architect from soul
1 年Is there a way authenticate Apache Hudi Datalake by Microsoft AD?
Software engineer
1 年Cf
Developer Relations
1 年awesome!