登录查看更多内容

Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval with Source Code

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2023年10月14日

Data processing at scale is a constant challenge in the world of big data. As organizations strive to analyze and extract insights from ever-growing data sets, the need for efficient data processing solutions becomes paramount. Apache Hudi, an open-source data management framework, provides the flexibility and performance required for large-scale data operations. In this blog, we'll explore a powerful solution that leverages Apache Hudi and DynamoDB to accelerate data processing by significantly improving commit time retrieval.

Videos Based guide

The Challenge: Faster, Better, Easier Data Processing

One of the key challenges in big data processing is efficiently managing and retrieving commit times. Commit times are crucial for tasks like point-in-time queries, incremental queries, and batch processing pipelines. Traditional approaches to commit time management often involve complex setups and infrastructure maintenance. This can result in slow processing times, increased operational overhead, and scalability challenges. As the size of the data and the number of commits increase, fetching commits and scanning files can become expensive.

The Solution: Leveraging Apache Hudi with DynamoDB

Our solution revolves around leveraging Apache Hudi's capabilities and DynamoDB, a highly scalable and serverless NoSQL database service provided by AWS. Here's how it works

Hudi HTTP Callback URI: To initiate the process, we use Hudi's HTTP callback feature. Whenever a data event occurs, Hudi sends this event to a Lambda function via a callback URL.
Lambda Function for Data Storage: The Lambda function is responsible for receiving the Hudi event and storing the data in DynamoDB. This Lambda function takes the event data and organizes it efficiently within DynamoDB.
Efficient Data Storage: The Lambda function extracts the essential information from the event, such as tableName, commitTime, and basePath. It also calculates the ingestion timestamp and various date components like year, month, day, hour, and minute.
Storing Data in DynamoDB: Using these extracted values, the Lambda function creates a DynamoDB item as a dictionary and inserts it into the DynamoDB table. This approach ensures efficient and organized data storage for future retrieval.
Serverless Scalability: The serverless nature of AWS Lambda ensures that the system can scale up or down automatically based on the number of incoming events. DynamoDB, being a managed NoSQL database, also scales efficiently based on the event load.

The result is a streamlined and serverless solution that efficiently stores commit time data for any given table. This data can be retrieved for point-in-time queries, incremental queries, or as part of batch processing pipelines.

Steps for Labs

Step 1: Create DynamoDB Table hudi_commits

The DynamoDB table structure is designed as follows:

Primary Key: TableName
Sort Key: CommitTime
Additional Fields: Year, Month, Day, Hour, Minute, and IngestionTime (or Event Time)

You can setup additional GSI and LSI if needed based on your access pattern

Step 2: AWS Lambda Functions

Lets Enable Functional URL on lambda Function

领英推荐

5 Trends in the Data Lakehouse Space

Alex Merced 6 个月前

Building a Data Ingestion Pipeline on Google Cloud…

Janitha Madushan 6 个月前

Understanding DynamoDB’s scaling features in the…

Uriel Bitton 1 个月前

Click on save and COPY functional URL

Configure IAM policy for Lambda Function and lets give permission so it can insert data into DynamoDB

After this we need to add 3 config on hudi side

Hudi Configuration

To enable Hudi to utilize this solution, you need to configure the following Hudi settings:

hoodie.write.commit.callback.http.url: Set this to the appropriate HTTP callback URL.
hoodie.write.commit.callback.on: Enable this callback feature.
hoodie.write.commit.callback.http.timeout.seconds: Set a reasonable timeout value for the HTTP callback.

By configuring Hudi with these settings, you ensure that it communicates effectively with the Lambda function for commit time storage.

Write Data into Hudi and you will see Lambdas function scaling and processing and inserting events into DyanamoDB

Dealing with Increasing Data Size

As data size increases, fetching commits and scanning files can become expensive operations. This is where the combination of Apache Hudi and DynamoDB shines. DynamoDB's scalability and performance capabilities make it well-suited to handle growing data volumes. DynamoDB can efficiently handle the increasing load, ensuring that fetching commits remains fast and cost-effective.

Furthermore, Apache Hudi's data management capabilities, including indexing and compaction, help in optimizing data access. This means that as data grows, you won't experience a linear increase in retrieval costs. Instead, with efficient data management, you can maintain reasonable retrieval times and cost-effectiveness even with large datasets.

Simplifying Downstream Data Processing

With commit times stored efficiently in DynamoDB, downstream applications, batch jobs, and incremental ETL processes can access commits at lightning speed. Whether you need to perform incremental updates or back-filling jobs, DynamoDB provides a central repository for all your commit time data. This approach allows you to focus on the active commits in the Hudi timeline and archive older commits, enhancing query performance.

Complete Code

https://github.com/soumilshah1995/-Accelerating-Data-Processing-Leveraging-Apache-Hudi-with-DynamoDB-for-Faster-Commit-Time-Retrieval/tree/main

Conclusion

Efficient data processing and commit time retrieval are critical components of big data operations. The combination of Apache Hudi and DynamoDB offers an elegant solution to address these challenges. By leveraging serverless infrastructure and automation, this approach simplifies the process, accelerates data processing, and reduces operational overhead. It enables organizations to perform point-in-time queries, incremental queries, and batch processing pipelines with ease. As data volumes continue to grow, solutions like this become invaluable for managing and extracting insights from large datasets.

Kyle Weller

VP of Product @ Onehouse.ai | ex Azure Databricks

1 年

Nice example of reverse etl Soumil S.. Not many people show good examples for this so kudos! This could be super useful to serve data to downstream real time business applications ??

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

2025年2月14日

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Introduction In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3…

1 条评论

See all articles

Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval with Source Code

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

The Challenge: Faster, Better, Easier Data Processing

The Solution: Leveraging Apache Hudi with DynamoDB

Step 1: Create DynamoDB Table hudi_commits

Step 2: AWS Lambda Functions

领英推荐

Hudi Configuration

Dealing with Increasing Data Size

Simplifying Downstream Data Processing

Conclusion

Soumil S.的更多文章

社区洞察

其他会员也浏览了

What is Azure Data Factory? An Introduction and Deep Dive

Real-Time detection and alerting of unwanted credit card charges (Part 3 of 3)

Chasing the Perfect Database - Rethinking Data Infrastructures from the Ground Up

Performance essentials - BigQuery & Distributed data processing systems

Tanzu Data in 2025: Optionality of Data Engines, Deployment Flexibility, and Data Strategy

DATA LAKES

Apache Iceberg Explained

ITea Talks with Hristo Zhelev: Indexing in DynamoDB

The Guide To DynamoDB Streams

BigLake : A Multi-Cloud Data Strategy

The Challenge: Faster, Better, Easier Data Processing

The Solution: Leveraging Apache Hudi with DynamoDB

Step 1: Create DynamoDB Table hudi_commits

Step 2: AWS Lambda Functions

领英推荐

Hudi Configuration

Dealing with Increasing Data Size

Simplifying Downstream Data Processing

Conclusion

Soumil S.的更多文章

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

社区洞察

其他会员也浏览了

What is Azure Data Factory? An Introduction and Deep Dive

Real-Time detection and alerting of unwanted credit card charges (Part 3 of 3)

Chasing the Perfect Database - Rethinking Data Infrastructures from the Ground Up

Performance essentials - BigQuery & Distributed data processing systems

Tanzu Data in 2025: Optionality of Data Engines, Deployment Flexibility, and Data Strategy

DATA LAKES

Apache Iceberg Explained

ITea Talks with Hristo Zhelev: Indexing in DynamoDB

The Guide To DynamoDB Streams

BigLake : A Multi-Cloud Data Strategy