登录查看更多内容

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

发布日期: 2024年11月22日

Effortlessly manage table syncing in multiple formats (Hudi, Delta, Iceberg) with this innovative AWS architecture. Designed for flexibility and scalability, this solution leverages Apache XTable, AWS Lambda, and API Gateway to give you control over how and when your tables are synced. Let’s dive into the details of this architecture and explore how it works.

Video Guides

Demo on AWS Lambda

Overview of the Architecture

This setup allows syncing tables in three formats—Hudi, Delta, and Iceberg. It supports:

Scheduled Syncs using CRON jobs.
Manual Syncs triggered via an API Gateway.
Process-Driven Triggers for real-time flexibility.

How It Works

CRON Configuration:

A CRON job is set up to point to a config.yaml file stored in an S3 bucket.
The CRON job triggers an AWS Lambda function, which reads the configuration and executes the Apache XTable sync command

Manual Sync:

Users or processes can initiate a manual sync by making a POST request to the API Gateway.
The API Gateway sends the request to a Lambda function, which runs the sync command with the specified configuration.

Serverless Scalability:

AWS Lambda provides automatic scaling, ensuring the system handles large workloads without manual intervention.

Technical Details

Dockerized Lambda Function

We leverage Docker to bundle all necessary dependencies, Java libraries, and Python code into a single, reusable container image.

Dockerfile

https://github.com/soumilshah1995/xtable-sync-lambda/blob/main/Dockerfile

requirements.txt

Python Lambda Code

The Lambda function is written in Python and uses the JPype library to interact with Apache XTable's Java classes.

https://github.com/soumilshah1995/xtable-sync-lambda/blob/main/lambda_function.py

Testing the Setup

Step 1: Build the Docker Image

Step 2: Run the Docker Container

Step 3: Trigger a Lambda Function locally

Output Screenshots

Why Choose This Architecture?

Flexibility: Sync tables automatically on a schedule or manually as needed.
Scalability: Built on AWS Lambda, the architecture adjusts seamlessly to workloads.
Ease of Use: Centralized configuration management with config.yaml in S3.
Future-Proof: Supports multiple table formats (Hudi, Delta, Iceberg), making it adaptable to evolving data needs.

Labs : https://github.com/soumilshah1995/xtable-sync-lambda

Conclusion

This architecture demonstrates how to combine the power of Apache XTable, AWS Lambda, and API Gateway for a robust table-syncing solution. Whether you need automated CRON jobs or manual sync triggers, this setup is a reliable and scalable choice.

Happy syncing!

#AWS #ApacheXTable #Serverless #Lambda #DataSync #CloudArchitecture

References

Lalit Moharana

AI Enthusiast || Data Science || Data Engineer || Product Engineer

3 天前

Really good usecase but I have one question. Since you are using lambda which has 15 mins Max runtime don't you think that will be a bottle neck for bigger table sizes ??

Vinish Reddy Pannala

4 天前

Great blog Soumil S. I feel the lambda function would be a good contribution in the XTable project too, can be useful for AWS users to get started. Your thoughts ? We can discuss more on how we package it etc.

Sagar Lakshmipathy

Solutions Engineering @ Onehouse | We're Hiring!

4 天前

i like the usage of jpype! nice blog Soumil S.

1 次回应

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

2024年11月24日

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

Amazon EMR (Elastic MapReduce) is a fully managed service that allows you to process vast amounts of data quickly and…

4 条评论
Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

2024年11月21日

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

In the world of modern data architectures, it is not uncommon to find multiple databases in use across an organization.…

4 条评论
Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

2024年11月17日

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Introduction: In the world of data engineering, organizing and managing data through a well-defined architecture is…

4 条评论
Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

2024年11月8日

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

Introduction In today’s data-driven world, handling large volumes of data efficiently is critical. When data arrives…

1 条评论
How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

2024年11月3日

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

In today's fast-paced data-driven world, maintaining a reliable and efficient data pipeline is crucial. Apache Iceberg,…
Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

2024年10月26日

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

In the realm of data engineering, managing large datasets can be a daunting task. Organizations are increasingly…

2 条评论
Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

2024年10月20日

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

Apache Polaris is an emerging open-source project designed to simplify and enhance cataloging, management, and access…
No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

2024年9月30日

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

In today's data-driven world, the ability to handle unstructured data is paramount. Organizations increasingly rely on…
Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

2024年9月29日

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Handling large amounts of semi-structured data, such as JSON, is a challenge for many data engineers. Whether you’re…

2 条评论
Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

2024年9月21日

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

In this blog post, we’ll explore how to leverage ClickHouse's materialized views to efficiently move data from Kafka…

See all articles

Overview of the Architecture

How It Works

Technical Details

Dockerized Lambda Function

Dockerfile

requirements.txt

Python Lambda Code

Testing the Setup

Step 2: Run the Docker Container

Step 3: Trigger a Lambda Function locally

Why Choose This Architecture?

Conclusion

References

Soumil S.的更多文章

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Simple Python Utility Class for Incremental File Retrieval and Processing (CSV, JSON, Parquet, Avro) from Local or Cloud Storage (file://,S3://, S3a:)

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs