登录查看更多内容

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2025年2月14日

Introduction

In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3 is a critical challenge. When dealing with large volumes of data, it’s often necessary to orchestrate multiple concurrent Spark writers, each performing operations like MERGE INTO across different partitions in the data lake. In this blog, we'll walk through how to manage these concurrent writes efficiently, using Apache Iceberg with Apache Spark to handle the complexities of partitioned data. We’ll also discuss how to configure key settings in Iceberg to ensure successful concurrent writes and provide a working example for testing these configurations.

The Challenge: Concurrent Writes to S3

When dealing with concurrent writes, especially with multiple Spark writers performing MERGE INTO operations on partitioned tables, there are challenges related to consistency, data integrity, and performance. Iceberg’s configuration provides robust options to manage these challenges.

In this example, we simulate 10 Spark writers running in parallel, each writing to different partitions of an Iceberg table stored in S3. We'll use the MERGE INTO SQL command to update records across different partitions.

Key Configurations in Iceberg

To handle concurrent writes efficiently, we rely on the following Iceberg configurations:

commit.retry.num-retries = 20

This configuration specifies the number of retries for a commit if a write fails. Setting this value ensures that Iceberg can automatically retry committing the transaction a number of times if conflicts or transient issues occur, thus ensuring reliability.

commit.retry.min-wait-ms = 30000

The minimum wait time (in milliseconds) between retries. This setting helps avoid overwhelming the system by spacing out retries.

write.merge.isolation-level = snapshot

This configuration controls the isolation level of the MERGE INTO operations. Setting it to snapshot ensures that the operation works on a snapshot of the data, providing a consistent view for concurrent writers, preventing conflicts during the write process.

Example: Test Case Setup

Let’s walk through an example where we simulate concurrent writes using 10 Spark writers, each performing MERGE INTO operations across different partitions.

Step 1: Create Iceberg table

Spark Writer Job

PySpark Code

Lets Submit 10 Spark jobs

Final Output

Conclusion

All MERGE INTO commands executed successfully on the Iceberg table, and although there were some errors during the process, the retry settings ensured that the operations were eventually successful. The commit.retry.num-retries, commit.retry.min-wait-ms, and write.merge.isolation-level configurations played a key role in guaranteeing reliable and consistent writes. These settings allowed for retries in case of failures, ensuring that all data writes successfully made it into the new S3 table buckets without losing any records.

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Leveraging S3 for Distributed Concurrency Control in Data Processing

2025年2月9日

Leveraging S3 for Distributed Concurrency Control in Data Processing

In distributed systems, managing concurrency—ensuring that only a set number of processes run in parallel—is crucial to…
Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

2025年2月8日

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

n this blog, we'll walk through creating and managing an EMR (Elastic MapReduce) cluster on EC2 to run PySpark jobs…

2 条评论
Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

2025年1月25日

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

If you have existing Iceberg tables and need to sync them with the AWS Glue Data Catalog, the iceberg-glue-syncPython…

1 条评论
Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint

2025年1月25日

Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint

The integration of Apache Iceberg with AWS Glue provides a powerful mechanism to handle large-scale data operations on…

See all articles

Introduction

The Challenge: Concurrent Writes to S3

Key Configurations in Iceberg

Example: Test Case Setup

Step 1: Create Iceberg table

Conclusion

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Leveraging S3 for Distributed Concurrency Control in Data Processing

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

Learn How to Connect to the Glue Data Catalog Using AWS Glue Iceberg REST Endpoint