登录查看更多内容

Multiple Spark Writers with Apache Hudi

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2024年6月4日

Apache Hudi, a distributed data management framework, has emerged as a crucial tool for managing large-scale datasets efficiently. One of its standout features is Multi Writers support, which enables multiple concurrent writers to modify data concurrently while ensuring data consistency and integrity. In this blog, we delve into the significance of Multi Writers, the importance of locks in this context, and how to enable Multi Writing in Apache Hudi. Additionally, we'll explore various lock providers and provide a simple lab to illustrate the concept.

Importance of Multi Writers and Scenarios of Use

Multi Writers support in Apache Hudi is essential for scenarios where multiple applications or processes need to write data concurrently to the same dataset. This capability unlocks the potential for real-time data ingestion, batch processing, and analytics pipelines, enabling faster insights and decision-making. Some common scenarios where Multi Writers are invaluable include:

Real-time Data Ingestion: In streaming data applications, multiple sources may simultaneously generate data that needs to be ingested into a centralized storage system. Multi Writers allow seamless ingestion without introducing bottlenecks.
Parallel ETL Processing: In Extract, Transform, Load (ETL) pipelines, different stages of processing might occur concurrently, with each stage updating the dataset. Multi Writers facilitate parallel processing, accelerating data transformation and loading tasks.
Collaborative Data Editing: In collaborative environments where multiple users interact with the same dataset, Multi Writers ensure that concurrent edits or updates do not conflict, preserving data consistency.

The Importance of Locks in Multi Writer Environments

In Multi Writer environments, the potential for conflicts arises when multiple writers attempt to modify the same data simultaneously. To maintain data integrity and consistency, Apache Hudi employs locking mechanisms. Locks prevent concurrent writers from interfering with each other's operations by serializing access to shared resources. Without proper locking, concurrent writes could lead to data corruption or inconsistencies.

List of Lock Providers in Apache Hudi

Apache Hudi offers several lock providers to facilitate Multi Writer support. These lock providers ensure that only one writer can modify the dataset at a given time, preventing conflicts and preserving data integrity. Some of the prominent lock providers include:

FileSystem-based Lock Provider (Experimental): This experimental lock provider uses the underlying filesystem to manage locks. While simple and lightweight, it may lack the scalability and robustness required for production environments.
ZooKeeper-based Lock Provider: ZooKeeper provides a distributed coordination service, making it suitable for managing locks in distributed systems. It offers strong consistency guarantees and is widely used in Apache Hudi deployments.
HiveMetastore-based Lock Provider: Leveraging Apache Hive's metastore, this lock provider offers compatibility with existing Hive deployments. It provides a familiar interface for managing locks and integrates seamlessly with Apache Hudi.
Amazon DynamoDB-based Lock Provider: Designed for deployments on Amazon Web Services (AWS), this lock provider utilizes DynamoDB, a fully managed NoSQL database service. It offers high availability, scalability, and durability, making it suitable for mission-critical applications.

Enabling Multi Writing

To enable Multi Writing in Apache Hudi, developers need to configure the appropriate lock provider based on their deployment environment and requirements. This involves specifying the lock provider in Hudi's configuration settings and ensuring that all writers use the same configuration.

Simple Lab to Illustrate Multi Writers

To demonstrate Multi Writers in action, let's consider a scenario with two jobs, U1 and U2, both writing and updating data to different partitions of a dataset. U1 updates data in partition NY, while U2 updates data in partition CA. By running these jobs concurrently, we can observe how Multi Writers enable simultaneous updates without conflicts.

U1.py

U2.py

领英推荐

Databricks vs. AWS Lakehouse

Xorbix Technologies, Inc. 4 个月前

Top 10 Data Pipeline Tools: Use Cases

Dr. Rabi Prasad Padhy 2 年前

AWS Data Engineering Essentials Guidebook

Factspan 1 年前

In a scenario where writer U1 starts writing and then writer U2 starts after a certain time delay (t+1), the following sequence of events occurs:

Writer U1 Starts Writing: U1 begins its write operation to the dataset. Upon initiating the write process, U1 acquires a lock to ensure exclusive access to the dataset during its write operation. This lock prevents other writers, including U2, from modifying the dataset concurrently.
Writer U2 Attempts to Start Writing: After a time delay of t+1, writer U2 begins its write operation to the same dataset. However, since U1 already holds the lock, U2 cannot acquire the lock immediately.
U2 Waits for Lock Release: Upon attempting to acquire the lock, U2 realizes that it's currently held by U1. As a result, U2 enters a waiting state, patiently waiting for the lock to be released by U1.
U1 Completes Write and Releases Lock: Meanwhile, U1 continues its write operation to completion. Once U1 finishes writing and no longer requires exclusive access to the dataset, it releases the lock it acquired earlier.
U2 Acquires Lock and Continues Writing: With the lock now released by U1, writer U2 can finally acquire the lock. Upon successfully acquiring the lock, U2 gains exclusive access to the dataset and can proceed with its write operations without any interference from other writers.

In summary, the lock mechanism ensures that only one writer can modify the dataset at any given time, thereby preventing conflicts and ensuring data integrity. Writers such as U2 must wait for the lock to be released by the current writer (U1) before they can proceed with their own write operations. This sequential access to the dataset guarantees consistency and prevents concurrent writes from causing data corruption or inconsistencies.

Final SnapShot of Table

All Writes Were were successful with Locks

GH: https://github.com/soumilshah1995/Multiple-Spark-Writers-with-Apache-Hudi/tree/main

Conclusion

Apache Hudi's Multi Writers support unlocks the potential for scalable and concurrent data processing, enabling real-time analytics, collaborative editing, and parallel processing. By understanding the importance of locks and selecting the appropriate lock provider, developers can ensure data integrity and consistency in Multi Writer environments. With the provided lab and references, developers can explore Multi Writers in Apache Hudi and harness its capabilities to build robust and scalable data applications.

I'm currently delving into multi-writer scenarios through small experiments in my own labs. This topic is one I intend to explore further and deepen my understanding of in the near future.

References

https://hudi.apache.org/docs/concurrency_control/

https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6

Bibhu Pala

SDE II@Amazon

9 个月

Wow thanks for sharing

Dipankar Mazumdar, M.Sc

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"

9 个月

Great write-up Soumil which I am sure will help the community.

1 次回应

Gatsby Lee

Senior Staff Data Engineer @ Forethought AI

9 个月

great!!

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

2025年3月29日

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

When it comes to organizing data for multi-tenant applications, one of the key architectural decisions is how to manage…
Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

2025年3月25日

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

We’ll be diving into AWS Managed Iceberg and exploring the latest features of S3 table buckets. Gain hands-on…

4 条评论
Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

4 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论

See all articles

Multiple Spark Writers with Apache Hudi

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & EMR | Data Lake(Hudi | Iceberg) Specialist | YouTuber

领英推荐

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Data Engineering on AWS

Demystifying File Formats in Data Engineering

Data Engineering Best Practices for Building Scalable Analytics Solutions

DATA Pill #076 - Distributed Computing MMA: Ray vs Spark, SQL cookbook for dbt

Data Engineering Demystified: Tools and Techniques You Need to Know

Data Engineering

An Overview on VDBMS: Vector Database Management Systems

AWS Data Engineering

Apache Iceberg: Transforming Data Lake Management for the AI Era

Open Source Data Engineering Stack

领英推荐

Soumil S.的更多文章

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing

Join us for an exciting workshop at the Iceberg Summit 2025 | Hands on Labs

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

社区洞察

其他会员也浏览了

Data Engineering on AWS

Demystifying File Formats in Data Engineering

Data Engineering Best Practices for Building Scalable Analytics Solutions

DATA Pill #076 - Distributed Computing MMA: Ray vs Spark, SQL cookbook for dbt

Data Engineering Demystified: Tools and Techniques You Need to Know

Data Engineering

An Overview on VDBMS: Vector Database Management Systems

AWS Data Engineering

Apache Iceberg: Transforming Data Lake Management for the AI Era

Open Source Data Engineering Stack