Test to see if you can join two managed Iceberg tables in different S3 table buckets and how you should configure the Spark session.
The Goal of This Test
The goal of this test is to verify that Apache Spark can join Iceberg tables stored in separate S3 Table buckets. By testing with the Customer table in demo-bucket1 and the Order table in demo-bucket2, we aim to ensure seamless cross-bucket joins and efficient data processing in AWS environments.
Step 1: Create Table Buckets
First, we'll create two S3 table buckets where our Iceberg tables will reside. Here's the command to create two separate buckets:
These two buckets (demo-bucket1 and demo-bucket2) will house the customers and orders Iceberg tables, respectively.
Step 2: Create Customer Table in demo-bucket1
Next, we’ll configure the Spark session and create the customers table in demo-bucket1.
Step 3: Create Orders Table in demo-bucket2
Similarly, we create the orders table in demo-bucket2:
Output
Bucket 1
Bucket 2
Step 4: Attempt to Join Tables from Different S3 Buckets
To join the customers table from demo-poc-bucket1 and the orders table from demo-poc-bucket2, we’ll configure the Spark session for multiple S3 table bucket ARNs.
Create Spark Session
Lets Try JOIN
Output
Success
Code
Conclusion
In this blog, we successfully configured Apache Spark to use multiple S3 table bucket ARNs and performed a join between two tables stored in different buckets (demo-bucket1 and demo-bucket2). This test validated that Spark can seamlessly join Iceberg tables across different S3 Table buckets
Product @ Qlik | Data engineer | Advocate for better data
1 个月Nice write up, but kind of messed up that you need to jump through all these hoops just to join two tables in the same lake and the same catalog. Did you try federating the S3 Tables with Glue catalog and then joining them using Glue catalog only?