登录查看更多内容

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2025年1月25日

If you have existing Iceberg tables and need to sync them with the AWS Glue Data Catalog, the iceberg-glue-syncPython package is your solution! This tool allows you to seamlessly register one or many Iceberg tables with the Glue Hive Metastore, making your data discoverable and queryable through AWS services.

Why Use iceberg-glue-sync?

Effortlessly sync existing Iceberg tables to Glue.
Works locally, on Airflow, or Amazon EMR.
Leverages a simple YAML configuration template to define table locations and details.

Video guides

Steps to Sync Your Tables

Create a YAML Configuration File:If you have existing tables, use the following template to define them along with AWS configurations:

Run the Sync Command:Execute the sync process by providing the YAML configuration file:

Output

Repo

https://github.com/soumilshah1995/iceberg-glue-sync

Key Use Cases

Sync Existing Tables: Already have Iceberg tables? Use the YAML template to register them effortlessly with Glue.
Flexibility: Run the tool locally, integrate it into Airflow workflows, or use it on Amazon EMR.

With iceberg-glue-sync, keeping your existing Iceberg tables synced with AWS Glue is hassle-free. Simplify your workflows and make your data ready for AWS analytics today!

Note:

I will be adding more sync functionality to support multiple catalogs in the future. Feel free to fork the repository and contribute! ??

While you can use AWS Glue crawlers for this process, my template offers the flexibility to add functionality and customize it based on your specific use cases and needs.

#AWS #ApacheIceberg #Glue #DataSync #DataEngineering #CloudComputing

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

2025年2月14日

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Introduction In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3…

1 条评论
Leveraging S3 for Distributed Concurrency Control in Data Processing

2025年2月9日

Leveraging S3 for Distributed Concurrency Control in Data Processing

In distributed systems, managing concurrency—ensuring that only a set number of processes run in parallel—is crucial to…
Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

2025年2月8日

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

n this blog, we'll walk through creating and managing an EMR (Elastic MapReduce) cluster on EC2 to run PySpark jobs…

2 条评论
Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint

2025年1月25日

Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint

The integration of Apache Iceberg with AWS Glue provides a powerful mechanism to handle large-scale data operations on…

See all articles

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Why Use iceberg-glue-sync?

Key Use Cases

Note:

Soumil S.的更多文章

社区洞察

其他会员也浏览了

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It

Honest feedback about our FULL AWS Lakehouse migration, and pain points of such projects

AWS update of Week 10 (6Mar-12Mar)

Learn How to Use New S3 Table Buckets and Build Iceberg Tables on EMR 7.5 | Hands-On Labs

Modern Data Platforms on AWS, Part 1: Services to Extract and Manipulate Data

AWS Under the Hood - Day 1

AWS Glue Data Catalog as the Metastore for Databricks

Cost Caution: Know Your Tools; or Pay the Price

Hands-On with DynamoDB

Building Your Own AWS Glue Bookmark: A Guide to Retrieving Only New Incremental Files

Why Use iceberg-glue-sync?

Key Use Cases

Note:

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Leveraging S3 for Distributed Concurrency Control in Data Processing

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint

社区洞察

其他会员也浏览了

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It

Honest feedback about our FULL AWS Lakehouse migration, and pain points of such projects

AWS update of Week 10 (6Mar-12Mar)

Learn How to Use New S3 Table Buckets and Build Iceberg Tables on EMR 7.5 | Hands-On Labs

Modern Data Platforms on AWS, Part 1: Services to Extract and Manipulate Data

AWS Under the Hood - Day 1

AWS Glue Data Catalog as the Metastore for Databricks

Cost Caution: Know Your Tools; or Pay the Price

Hands-On with DynamoDB

Building Your Own AWS Glue Bookmark: A Guide to Retrieving Only New Incremental Files