Delta Lake, Iceberg & Hudi: A Transactional Perspective

Delta Lake, Iceberg & Hudi: A Transactional Perspective

Abstract. Transactions with their ACID guarantees used to be the backbone of Database Management Systems. With the arrival of Streaming and NoSQL, however, transactions were considered too strict and difficult to implement for Big Data platforms. Eventual consistency became the norm for such platforms, where some distributed nodes may be inconsistent in between — returning different values; with all nodes converging to the same value at a later point in time.

However, Big Data platforms / frameworks have now matured to the point that we are seeing a resurgence of platforms providing ACID support, e.g., Delta Lake, Hudi, Iceberg. In this article, we make the case for Transactions in a Big Data World, providing the necessary background on how transactions can be implemented in such scenarios. We show a concrete application in terms of how ACID transactions play a key role in enabling Data Historization.

Transactions

A transaction [1] can be considered as a group of operations encapsulated by the operations Begin and Commit/Abort having the following properties (ACID):

  • Atomicity: Either all the operations are executed or none of them are executed. In case of failure (abort), the effects of any operation belonging to the transaction are canceled (roll-back).
  • Consistency: Each transaction moves the system from one consistent state to another.
  • Isolation: To improve performance, often several transactions are executed concurrently. Isolation necessitates that the effects of such concurrent execution are equivalent to that of a serial execution. This is achieved by ensuring that the intermediate results of a transaction are not externalized until it completes successfully (commits).
  • Durability: Once a transaction commits, its effects are durable, i.e., they should not be destroyed by any system or software crash.

For example, let us consider the classic bank transaction t which involves transferring money from an account A to another account B. The transaction consists of two operations — the first operation withdraws money from account A and the second deposits it into account B. Needless to say, any partial execution of the transaction would result in an inconsistent state. The atomicity property ensures that either both withdraw and deposit operations succeed or both fail. The isolation property ensures that the changes to both accounts A and B are not visible to other transactions until t commits. The atomicity and isolation property together ensure the consistency of the system (accounts A and B).

Transactions in a Big Data?context

Delta Lake is an open-source framework that brings ACID transactions to Apache Spark and big data workloads. One can download open-source Delta Lake and use it on-prem with HDFS. The framework allows reading from any storage system that supports Apache Spark’s data sources and writing to Delta Lake, which stores data in Parquet format. Refer to [3] for the technical details.

ACID transactions provide the key “time travel” feature of Delta Lake that allows exploring the data at a particular point in time, enabling access and rollback to earlier versions of data - critical for audits and model prediction reproducibility.

Alternate data platforms / frameworks that can be leveraged here include Apache Hudi and Apache Iceberg. Hudi follows the conventional “transaction log” based approach with timestamped data files and log files that track changes to the records in data files. Iceberg provides ACID support using the below metadata hierarchy:

  • table ‘metadata files’
  • ‘manifest lists’ that correspond to a snapshot of the table
  • ‘manifests’ that define groups of data files that can be part of multiple snapshots

Table writes create a new snapshot, which can run in parallel with concurrent queries (returning the last snapshot value). Concurrent writes follow an optimistic concurrency control protocol, with the first transaction to commit succeeding; leading to a restart of all other conflicting concurrent transactions.

We show how ACID transactions play a key role in Data Historization.

Data Historization

Historization is the process of keeping track of data changes over time. It is applicable at all layers of the data platform (Fig. 2): source data ingested into ‘Raw’ (Bronze), to cleansed data in ‘Refined’ (Sliver), to transformed data in ‘Curated’ (Gold).

There are primarily two data historization methods:

  • Slowly Changing Dimensions Type 2 (SCD2): A history of changes adds a new record for an identifier every time there is a change in one or more column values. For example, refer to [4, 5] for details of implementing SCD2 on AWS and Oracle Cloud platforms respectively.
  • A history of snapshots adds a new record for an identifier every time a data delivery from the source system arrives, whether it contains a change or not.
  • This approach is gaining traction in the context of Big Data given the low cost of storage (esp. the Raw & Staging layers), ease of implementation and robustness against late-arriving data.

Hybrid approach:

History of snapshots for the Raw / Staging layer: Implement a persistent staging area that acts as a permanent archive of source data deliveries. This allows for loading and historizing source data before they have been explicitly modeled and serves as a kind of insurance against modeling mistakes in the Curated layer / Aggregate Data Store.

Leverage SCD2 for the Curated layer / Aggregate Data Store:

SCD2 remains the most popular historization technique for the Curated layer. When a new record in created, the previous record needs to expire. This consists of multiple operations (refer to the sample table below for illustration):

  • find record in the current dataset, set eff_end_data of existing record to eff_start_date of new record, and set is_current flag of existing record to ‘false’
  • Insert new record.

Delta Lake / Hudi / Iceberg with its support for ACID transactions makes it possible to perform such Upsert (Merge) operations in a reliable and scalable fashion.

References

  1. D. Biswas, K. Vidyasankar. Secure Cloud Transactions. Comput. Syst. Sci. Eng.?28(6)?(2013) https://www.researchgate.net/publication/256309315_Secure_Cloud_Transactions
  2. D. Biswas. Compensation in the World of Web Services Composition. In: Semantic Web Services and Web Process Composition. SWSWPC 2004. Lecture Notes in Computer Science, vol 3387. https://doi.org/10.1007/978-3-540-30581-1_7
  3. M. Armbrust, et. al. Delta lake: high-performance ACID table storage over cloud object stores. Proc. VLDB Endow. 13, 12 (August 2020), 3411–3424. https://doi.org/10.14778/3415478.3415560
  4. D. Greenshtein. Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and Apache Hudi on Amazon EMR, 2021, https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
  5. A. Duvuri. Slowly Changing Dimensions (SCD) Type 2 Implementation in Oracle Cloud Infrastructure (OCI) Data Integration, 2020, https://blogs.oracle.com/dataintegration/post/slowly-changing-dimensions-scd-type-2-implementation-in-oracle-cloud-infrastructure-oci-data-integration

Allan Wille

Co-Founder & CEO at Klipfolio

5 个月

What do you see as the key technical and architectural factors that give certain formats like Delta Lake or Iceberg potential advantages over others in specific big data workloads or deployment environments?

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了