登录查看更多内容

Test to see if you can join two managed Iceberg tables in different S3 table buckets and how you should configure the Spark session.

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2025年1月8日

The Goal of This Test

The goal of this test is to verify that Apache Spark can join Iceberg tables stored in separate S3 Table buckets. By testing with the Customer table in demo-bucket1 and the Order table in demo-bucket2, we aim to ensure seamless cross-bucket joins and efficient data processing in AWS environments.

Step 1: Create Table Buckets

First, we'll create two S3 table buckets where our Iceberg tables will reside. Here's the command to create two separate buckets:

These two buckets (demo-bucket1 and demo-bucket2) will house the customers and orders Iceberg tables, respectively.

Step 2: Create Customer Table in demo-bucket1

Next, we’ll configure the Spark session and create the customers table in demo-bucket1.

Step 3: Create Orders Table in demo-bucket2

Similarly, we create the orders table in demo-bucket2:

Output

Bucket 1

Bucket 2

Step 4: Attempt to Join Tables from Different S3 Buckets

To join the customers table from demo-poc-bucket1 and the orders table from demo-poc-bucket2, we’ll configure the Spark session for multiple S3 table bucket ARNs.

Create Spark Session

Lets Try JOIN

Output

Success

Code

https://soumilshah1995.blogspot.com/2025/01/test-to-see-if-you-can-join-two-managed.html

Conclusion

In this blog, we successfully configured Apache Spark to use multiple S3 table bucket ARNs and performed a join between two tables stored in different buckets (demo-bucket1 and demo-bucket2). This test validated that Spark can seamlessly join Iceberg tables across different S3 Table buckets

Roy Hasson

Product @ Qlik | Data engineer | Advocate for better data

1 个月

Nice write up, but kind of messed up that you need to jump through all these hoops just to join two tables in the same lake and the same catalog. Did you try federating the S3 Tables with Glue catalog and then joining them using Glue catalog only?

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

2025年2月14日

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Introduction In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3…

1 条评论
Leveraging S3 for Distributed Concurrency Control in Data Processing

2025年2月9日

Leveraging S3 for Distributed Concurrency Control in Data Processing

In distributed systems, managing concurrency—ensuring that only a set number of processes run in parallel—is crucial to…
Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

2025年2月8日

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

n this blog, we'll walk through creating and managing an EMR (Elastic MapReduce) cluster on EC2 to run PySpark jobs…

2 条评论
Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

2025年1月25日

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

If you have existing Iceberg tables and need to sync them with the AWS Glue Data Catalog, the iceberg-glue-syncPython…

1 条评论

See all articles

Test to see if you can join two managed Iceberg tables in different S3 table buckets and how you should configure the Spark session.

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Step 1: Create Table Buckets

Step 2: Create Customer Table in demo-bucket1

Step 3: Create Orders Table in demo-bucket2

Step 4: Attempt to Join Tables from Different S3 Buckets

Conclusion

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Which Data Lake storage format wins the popularity contest?

Spatial analytics on Apache Superset

$elemMatch operator in Mongoose

CASSUG Meeting - Albany NY - 9.9.2024

?? Understanding SparkContext vs. SparkSession ??

Setting Up VoltDB with DBeaver: A Step-by-Step Guide

Deploy SuperSet??

Copy Backup Files to Blob

?? Exploring Apache Hudi and Delta Tables: Making Data Lake Management Smoother! ???

News of the week!

Step 1: Create Table Buckets

Step 2: Create Customer Table in demo-bucket1

Step 3: Create Orders Table in demo-bucket2

Step 4: Attempt to Join Tables from Different S3 Buckets

Conclusion

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Leveraging S3 for Distributed Concurrency Control in Data Processing

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

社区洞察

其他会员也浏览了

Which Data Lake storage format wins the popularity contest?

Spatial analytics on Apache Superset

$elemMatch operator in Mongoose

CASSUG Meeting - Albany NY - 9.9.2024

?? Understanding SparkContext vs. SparkSession ??

Setting Up VoltDB with DBeaver: A Step-by-Step Guide

Deploy SuperSet??

Copy Backup Files to Blob

?? Exploring Apache Hudi and Delta Tables: Making Data Lake Management Smoother! ???

News of the week!