Delta Lake, Iceberg & Hudi: A Transactional Perspective
Debmalya Biswas
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
Abstract. Transactions with their ACID guarantees used to be the backbone of Database Management Systems. With the arrival of Streaming and NoSQL, however, transactions were considered too strict and difficult to implement for Big Data platforms. Eventual consistency became the norm for such platforms, where some distributed nodes may be inconsistent in between — returning different values; with all nodes converging to the same value at a later point in time.
However, Big Data platforms / frameworks have now matured to the point that we are seeing a resurgence of platforms providing ACID support, e.g., Delta Lake, Hudi, Iceberg. In this article, we make the case for Transactions in a Big Data World, providing the necessary background on how transactions can be implemented in such scenarios. We show a concrete application in terms of how ACID transactions play a key role in enabling Data Historization.
Transactions
A transaction [1] can be considered as a group of operations encapsulated by the operations Begin and Commit/Abort having the following properties (ACID):
For example, let us consider the classic bank transaction t which involves transferring money from an account A to another account B. The transaction consists of two operations — the first operation withdraws money from account A and the second deposits it into account B. Needless to say, any partial execution of the transaction would result in an inconsistent state. The atomicity property ensures that either both withdraw and deposit operations succeed or both fail. The isolation property ensures that the changes to both accounts A and B are not visible to other transactions until t commits. The atomicity and isolation property together ensure the consistency of the system (accounts A and B).
Transactions in a Big Data?context
Delta Lake is an open-source framework that brings ACID transactions to Apache Spark and big data workloads. One can download open-source Delta Lake and use it on-prem with HDFS. The framework allows reading from any storage system that supports Apache Spark’s data sources and writing to Delta Lake, which stores data in Parquet format. Refer to [3] for the technical details.
ACID transactions provide the key “time travel” feature of Delta Lake that allows exploring the data at a particular point in time, enabling access and rollback to earlier versions of data - critical for audits and model prediction reproducibility.
Alternate data platforms / frameworks that can be leveraged here include Apache Hudi and Apache Iceberg. Hudi follows the conventional “transaction log” based approach with timestamped data files and log files that track changes to the records in data files. Iceberg provides ACID support using the below metadata hierarchy:
Table writes create a new snapshot, which can run in parallel with concurrent queries (returning the last snapshot value). Concurrent writes follow an optimistic concurrency control protocol, with the first transaction to commit succeeding; leading to a restart of all other conflicting concurrent transactions.
We show how ACID transactions play a key role in Data Historization.
领英推荐
Data Historization
Historization is the process of keeping track of data changes over time. It is applicable at all layers of the data platform (Fig. 2): source data ingested into ‘Raw’ (Bronze), to cleansed data in ‘Refined’ (Sliver), to transformed data in ‘Curated’ (Gold).
There are primarily two data historization methods:
Hybrid approach:
History of snapshots for the Raw / Staging layer: Implement a persistent staging area that acts as a permanent archive of source data deliveries. This allows for loading and historizing source data before they have been explicitly modeled and serves as a kind of insurance against modeling mistakes in the Curated layer / Aggregate Data Store.
Leverage SCD2 for the Curated layer / Aggregate Data Store:
SCD2 remains the most popular historization technique for the Curated layer. When a new record in created, the previous record needs to expire. This consists of multiple operations (refer to the sample table below for illustration):
Delta Lake / Hudi / Iceberg with its support for ACID transactions makes it possible to perform such Upsert (Merge) operations in a reliable and scalable fashion.
References
Co-Founder & CEO at Klipfolio
5 个月What do you see as the key technical and architectural factors that give certain formats like Delta Lake or Iceberg potential advantages over others in specific big data workloads or deployment environments?