An Iceberg Table’s Architecture

An Iceberg Table’s Architecture

There are 3 layers in the architecture of an Iceberg table:

  1. The Iceberg catalog

  • Within the Catalog layer, there is a reference or pointer, that points to the current metadata file for that table.
  • As catalog is an interface and the only requirement for an Iceberg catalog is that it needs to store the current metadata pointer and provide atomic guarantees, there are different backends that can serve as the Iceberg catalog like Hadoop, AWS S3, Hive, AWS Glue Catalog and more. These different implementations store the current metadata pointer differently.

2. The metadata layer,

which contains metadata files, manifest lists, and manifest files

  • The files in this layer are immutable files so everytime an insert, merge, upsert or delete operation happens on the table, a new set of files are written.

Manifest Files

Manifest files keep track of files in the data layer along with the additional details and statistics about each file. It stores all this information in avro file format.

Manifest Lists

Manifest lists keep track of manifest files, including the location, the partitions it belong to, and the upper and lower bound for partition columns for the data it tracks. It stores all this information in avro file format.

A Manifest list file is a snapshot of an Iceberg Table as it contains the details of the snapshot along with snapshot_id that has added it.

Metadata Files

Metadata files keep track of Manifest Lists. These files include the information about the metadata of the Iceberg Table at a certain point in time i.e. table’s schema, partition information, snapshots, and which snapshot is the current one. All this information is stored in a json format file.

The data layer:

  • This is the layer where the actual data for the table is stored and is primarily made of data files. Apache Iceberg is file-format agnostic and it currently supports Apache Parquet, Apache ORC, and Apache Avro. It stores the data by default in Apache Parquet file format.
  • This file-format agnostic provides the ability for a user to choose the underlying file format based on the use case, for example, Parquet might be used for a large-scale OLAP analytics table, whereas Avro might be used for a low-latency streaming analytics table
  • The data layer is backed by a distributed file system like HDFS or a cloud object storage like AWS S3. This enables building data lakehouse architectures that can benefit from these extremely scalable and low-cost storage systems


https://www.dremio.com/wp-content/uploads/2023/04/catalog-1536x781.png


  1. Iceberg catalog

  • This central place where you go to find the current location of the current metadata pointer is the Iceberg catalog.


CREATE TABLE

First, let’s create a table in our environment.

CREATE TABLE table1 (

order_id BIGINT,

customer_id BIGINT,

order_amount DECIMAL(10, 2),

order_ts TIMESTAMP

)

USING iceberg

PARTITIONED BY ( HOUR(order_ts) );


https://www.dremio.com/wp-content/uploads/2021/06/create-table-1-1536x796.png


  • Above, we created a table called table1 in database db1. The table has 4 columns and is partitioned at the hour granularity of the order_ts timestamp column (more on that later).
  • When the query above is executed, a metadata file with a snapshot s0 is created in the metadata layer
  • snapshot s0 doesn’t point to any manifest lists because no data exists in the table yet
  • The catalog entry for db1.table1’s current metadata pointer is then updated to point to the path of this new metadata file.

INSERT

Now, let’s add some data to the table (albeit, literal values).

INSERT INTO table1 VALUES (

123,

456,

36.17,

'2021-01-26 08:10:23'

);


https://www.dremio.com/wp-content/uploads/2021/06/insert-1536x818.png


  1. The data in the form of a Parquet file is first created – table1/data/order_ts_hour=2021-01-26-08/00000-5-cae2d.parquet
  2. Then, a manifest file pointing to this data file is created (including the additional details and statistics) – table1/metadata/d8f9-ad19-4e.avro
  3. Then, a manifest list pointing to this manifest file is created (including the additional details and statistics) – table1/metadata/snap-2938-1-4103.avro
  4. Then, a new metadata file is created based on the previously current metadata file with a new snapshot s1 as well as keeping track of the previous snapshot s0, pointing to this manifest list (including the additional details and statistics) – table1/metadata/v2.metadata.json
  5. Then, the value of the current metadata pointer for db1.table1 is atomically updated in the catalog to now point to this new metadata file.

During all of these steps, anyone reading the table would continue to read the first metadata file until the atomic step #5 is complete, meaning that no one using the data would ever see an inconsistent view of the table’s state and contents.



MERGE INTO / UPSERT

In this example, the stage table includes an update for the order that’s already in the table (order_id=123) and a new order that isn’t in the table yet, which occurred on January 27, 2021 at 10:21:46.

MERGE INTO table1 AS t

USING (SELECT * FROM table1_stage) AS s

ON t.order_id = s.order_id

WHEN MATCHED THEN

UPDATE SET t.order_amount = s.order_amount

WHEN NOT MATCHED THEN

INSERT (order_id, order_amount)

VALUES (s.order_id, s.order_amount);


https://www.dremio.com/wp-content/uploads/2023/04/merge-into-1536x855.png


Read Path:

  • Iceberg reads the relevant records from table1 and table1_stage that share the same order_id. For example, the record with order_id=123 from table1 is read from the file 00000-5-cae2d.parquet into the query engine’s memory.

Copy-on-Write:

  • The order_id=123 record's order_amount is updated in memory. The entire file, including any non-matching records, is then copied and written to a new Parquet file: table1/data/order_ts_hour=2021-01-26-08/00000-1-aef71.parquet.

Merge-on-Read (Upcoming):While not yet implemented, this future Iceberg strategy will provide a different approach to updates, but the current method still uses copy-on-write.

New Data Files:

  • If a record from table1_stage doesn’t match any in table1, it is written to a new Parquet file in a different partition: table1/data/order_ts_hour=2021-01-27-10/00000-3-0fa3a.parquet.

Manifest Files

  • :A new manifest file is created to track these data files, including details and statistics: table1/metadata/0d9a-98fa-77.avro.

Manifest List:

  • A new manifest list is created, pointing to the newly created manifest file: table1/metadata/snap-9fa1-3-16c3.avro.

Metadata Updates:

  • A new metadata file is created based on the current metadata, pointing to the new snapshot s2 and retaining references to previous snapshots s0 and s1: table1/metadata/v3.metadata.json.
  • The current metadata pointer in the catalog for db1.table1 is atomically updated to point to this new metadata file.

SELECT

Let’s review the SELECT path again, but this time on the Iceberg table we’ve been working on


https://www.dremio.com/wp-content/uploads/2023/04/select-1536x758.png

When this SELECT statement is executed, the following process happens:

  1. The query engine goes to the Iceberg catalog
  2. It then retrieves the current metadata file location entry for db1.table1
  3. It then opens this metadata file and retrieves the entry for the manifest list location for the current snapshot, s2
  4. It then opens this manifest list, retrieving the location of the only manifest file
  5. It then opens this manifest file, retrieving the location of the two data files
  6. It then reads these data files, and since it’s a SELECT *, returns the data back to the client


Refer:

https://medium.com/@geekfrosty/apache-iceberg-architecture-demystified-e19b5cae9975

https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/



要查看或添加评论,请登录

Arabinda Mohapatra的更多文章

  • A Deep Dive into Caching Strategies in Snowflake

    A Deep Dive into Caching Strategies in Snowflake

    What is Caching? Caching is a technique used to store the results of previously executed queries or frequently accessed…

  • A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

    A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

    An external table is a Snowflake feature that allows you to query data stored in an external stage as if the data were…

  • Apache Iceberg

    Apache Iceberg

    Apache Iceberg Apache Iceberg is an open-source table format designed to handle large-scale analytic datasets…

  • Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

    Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

    1. Table Storage Metrics select TABLE_SCHEMA,TABLE_CATALOG AS"DB",TABLE_SCHEMA, TABLE_NAME,sum(ACTIVE_BYTES) +…

    1 条评论
  • Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

    Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

    USE WAREHOUSE LRN; USE DATABASE LRN_DB; USE SCHEMA LEARNING; ---Create a Table in snowflake as per the source data…

    1 条评论
  • Data Loading with Snowflake's COPY INTO Command-Table

    Data Loading with Snowflake's COPY INTO Command-Table

    Snowflake's COPY INTO command is a powerful tool for data professionals, streamlining the process of loading data from…

  • SNOW-SQL in SNOWFLAKE

    SNOW-SQL in SNOWFLAKE

    SnowSQL is a command-line tool designed by Snowflake to interact with Snowflake databases. It allows users to execute…

  • Stages in Snowflake

    Stages in Snowflake

    Stages in Snowflake play a crucial role in data loading and unloading processes. They serve as intermediary storage…

  • Snowflake Tips

    Snowflake Tips

    ??Tip 1: Use the USE statement to switch between warehouses Instead of specifying the warehouse name in every query…

  • SnowFlake

    SnowFlake

    ??What is a Virtual Warehouse in Snowflake? ??A Virtual Warehouse in Snowflake is a cluster of compute resources that…

社区洞察

其他会员也浏览了