Multiple Spark  Writers with Apache Hudi

Multiple Spark Writers with Apache Hudi

Apache Hudi, a distributed data management framework, has emerged as a crucial tool for managing large-scale datasets efficiently. One of its standout features is Multi Writers support, which enables multiple concurrent writers to modify data concurrently while ensuring data consistency and integrity. In this blog, we delve into the significance of Multi Writers, the importance of locks in this context, and how to enable Multi Writing in Apache Hudi. Additionally, we'll explore various lock providers and provide a simple lab to illustrate the concept.

Importance of Multi Writers and Scenarios of Use

Multi Writers support in Apache Hudi is essential for scenarios where multiple applications or processes need to write data concurrently to the same dataset. This capability unlocks the potential for real-time data ingestion, batch processing, and analytics pipelines, enabling faster insights and decision-making. Some common scenarios where Multi Writers are invaluable include:

  1. Real-time Data Ingestion: In streaming data applications, multiple sources may simultaneously generate data that needs to be ingested into a centralized storage system. Multi Writers allow seamless ingestion without introducing bottlenecks.
  2. Parallel ETL Processing: In Extract, Transform, Load (ETL) pipelines, different stages of processing might occur concurrently, with each stage updating the dataset. Multi Writers facilitate parallel processing, accelerating data transformation and loading tasks.
  3. Collaborative Data Editing: In collaborative environments where multiple users interact with the same dataset, Multi Writers ensure that concurrent edits or updates do not conflict, preserving data consistency.

The Importance of Locks in Multi Writer Environments

In Multi Writer environments, the potential for conflicts arises when multiple writers attempt to modify the same data simultaneously. To maintain data integrity and consistency, Apache Hudi employs locking mechanisms. Locks prevent concurrent writers from interfering with each other's operations by serializing access to shared resources. Without proper locking, concurrent writes could lead to data corruption or inconsistencies.

List of Lock Providers in Apache Hudi

Apache Hudi offers several lock providers to facilitate Multi Writer support. These lock providers ensure that only one writer can modify the dataset at a given time, preventing conflicts and preserving data integrity. Some of the prominent lock providers include:

  1. FileSystem-based Lock Provider (Experimental): This experimental lock provider uses the underlying filesystem to manage locks. While simple and lightweight, it may lack the scalability and robustness required for production environments.
  2. ZooKeeper-based Lock Provider: ZooKeeper provides a distributed coordination service, making it suitable for managing locks in distributed systems. It offers strong consistency guarantees and is widely used in Apache Hudi deployments.
  3. HiveMetastore-based Lock Provider: Leveraging Apache Hive's metastore, this lock provider offers compatibility with existing Hive deployments. It provides a familiar interface for managing locks and integrates seamlessly with Apache Hudi.
  4. Amazon DynamoDB-based Lock Provider: Designed for deployments on Amazon Web Services (AWS), this lock provider utilizes DynamoDB, a fully managed NoSQL database service. It offers high availability, scalability, and durability, making it suitable for mission-critical applications.

Enabling Multi Writing

To enable Multi Writing in Apache Hudi, developers need to configure the appropriate lock provider based on their deployment environment and requirements. This involves specifying the lock provider in Hudi's configuration settings and ensuring that all writers use the same configuration.

Simple Lab to Illustrate Multi Writers

To demonstrate Multi Writers in action, let's consider a scenario with two jobs, U1 and U2, both writing and updating data to different partitions of a dataset. U1 updates data in partition NY, while U2 updates data in partition CA. By running these jobs concurrently, we can observe how Multi Writers enable simultaneous updates without conflicts.


U1.py


U2.py


In a scenario where writer U1 starts writing and then writer U2 starts after a certain time delay (t+1), the following sequence of events occurs:

  1. Writer U1 Starts Writing: U1 begins its write operation to the dataset. Upon initiating the write process, U1 acquires a lock to ensure exclusive access to the dataset during its write operation. This lock prevents other writers, including U2, from modifying the dataset concurrently.
  2. Writer U2 Attempts to Start Writing: After a time delay of t+1, writer U2 begins its write operation to the same dataset. However, since U1 already holds the lock, U2 cannot acquire the lock immediately.
  3. U2 Waits for Lock Release: Upon attempting to acquire the lock, U2 realizes that it's currently held by U1. As a result, U2 enters a waiting state, patiently waiting for the lock to be released by U1.
  4. U1 Completes Write and Releases Lock: Meanwhile, U1 continues its write operation to completion. Once U1 finishes writing and no longer requires exclusive access to the dataset, it releases the lock it acquired earlier.
  5. U2 Acquires Lock and Continues Writing: With the lock now released by U1, writer U2 can finally acquire the lock. Upon successfully acquiring the lock, U2 gains exclusive access to the dataset and can proceed with its write operations without any interference from other writers.

In summary, the lock mechanism ensures that only one writer can modify the dataset at any given time, thereby preventing conflicts and ensuring data integrity. Writers such as U2 must wait for the lock to be released by the current writer (U1) before they can proceed with their own write operations. This sequential access to the dataset guarantees consistency and prevents concurrent writes from causing data corruption or inconsistencies.

Final SnapShot of Table

All Writes Were were successful with Locks


GH: https://github.com/soumilshah1995/Multiple-Spark-Writers-with-Apache-Hudi/tree/main


Conclusion

Apache Hudi's Multi Writers support unlocks the potential for scalable and concurrent data processing, enabling real-time analytics, collaborative editing, and parallel processing. By understanding the importance of locks and selecting the appropriate lock provider, developers can ensure data integrity and consistency in Multi Writer environments. With the provided lab and references, developers can explore Multi Writers in Apache Hudi and harness its capabilities to build robust and scalable data applications.

I'm currently delving into multi-writer scenarios through small experiments in my own labs. This topic is one I intend to explore further and deepen my understanding of in the near future.

References

https://hudi.apache.org/docs/concurrency_control/

https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6


Bibhu Pala

SDE II@Amazon

9 个月

Wow thanks for sharing

回复
Dipankar Mazumdar, M.Sc

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"

9 个月

Great write-up Soumil which I am sure will help the community.

Gatsby Lee

Senior Staff Data Engineer @ Forethought AI

9 个月

great!!

回复

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了