Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Effortlessly manage table syncing in multiple formats (Hudi, Delta, Iceberg) with this innovative AWS architecture. Designed for flexibility and scalability, this solution leverages Apache XTable, AWS Lambda, and API Gateway to give you control over how and when your tables are synced. Let’s dive into the details of this architecture and explore how it works.


Video Guides


Demo on AWS Lambda


Overview of the Architecture

This setup allows syncing tables in three formats—Hudi, Delta, and Iceberg. It supports:

  1. Scheduled Syncs using CRON jobs.
  2. Manual Syncs triggered via an API Gateway.
  3. Process-Driven Triggers for real-time flexibility.

How It Works

CRON Configuration:

  • A CRON job is set up to point to a config.yaml file stored in an S3 bucket.
  • The CRON job triggers an AWS Lambda function, which reads the configuration and executes the Apache XTable sync command

Manual Sync:

  • Users or processes can initiate a manual sync by making a POST request to the API Gateway.
  • The API Gateway sends the request to a Lambda function, which runs the sync command with the specified configuration.

Serverless Scalability:

  • AWS Lambda provides automatic scaling, ensuring the system handles large workloads without manual intervention.

Technical Details

Dockerized Lambda Function

We leverage Docker to bundle all necessary dependencies, Java libraries, and Python code into a single, reusable container image.

Dockerfile

https://github.com/soumilshah1995/xtable-sync-lambda/blob/main/Dockerfile

requirements.txt


Python Lambda Code

The Lambda function is written in Python and uses the JPype library to interact with Apache XTable's Java classes.

https://github.com/soumilshah1995/xtable-sync-lambda/blob/main/lambda_function.py

Testing the Setup

Step 1: Build the Docker Image

Step 2: Run the Docker Container



Step 3: Trigger a Lambda Function locally



Output Screenshots



Why Choose This Architecture?

  1. Flexibility: Sync tables automatically on a schedule or manually as needed.
  2. Scalability: Built on AWS Lambda, the architecture adjusts seamlessly to workloads.
  3. Ease of Use: Centralized configuration management with config.yaml in S3.
  4. Future-Proof: Supports multiple table formats (Hudi, Delta, Iceberg), making it adaptable to evolving data needs.

Labs : https://github.com/soumilshah1995/xtable-sync-lambda

Conclusion

This architecture demonstrates how to combine the power of Apache XTable, AWS Lambda, and API Gateway for a robust table-syncing solution. Whether you need automated CRON jobs or manual sync triggers, this setup is a reliable and scalable choice.

Happy syncing!

#AWS #ApacheXTable #Serverless #Lambda #DataSync #CloudArchitecture

References


Lalit Moharana

AI Enthusiast || Data Science || Data Engineer || Product Engineer

3 天前

Really good usecase but I have one question. Since you are using lambda which has 15 mins Max runtime don't you think that will be a bottle neck for bigger table sizes ??

回复

Great blog Soumil S. I feel the lambda function would be a good contribution in the XTable project too, can be useful for AWS users to get started. Your thoughts ? We can discuss more on how we package it etc.

回复
Sagar Lakshmipathy

Solutions Engineering @ Onehouse | We're Hiring!

4 天前

i like the usage of jpype! nice blog Soumil S.

要查看或添加评论,请登录

Soumil S.的更多文章