An Iceberg Table’s Architecture
Arabinda Mohapatra
Pyspark, SnowFlake,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom
There are 3 layers in the architecture of an Iceberg table:
2. The metadata layer,
which contains metadata files, manifest lists, and manifest files
Manifest Files
Manifest files keep track of files in the data layer along with the additional details and statistics about each file. It stores all this information in avro file format.
Manifest Lists
Manifest lists keep track of manifest files, including the location, the partitions it belong to, and the upper and lower bound for partition columns for the data it tracks. It stores all this information in avro file format.
A Manifest list file is a snapshot of an Iceberg Table as it contains the details of the snapshot along with snapshot_id that has added it.
Metadata Files
Metadata files keep track of Manifest Lists. These files include the information about the metadata of the Iceberg Table at a certain point in time i.e. table’s schema, partition information, snapshots, and which snapshot is the current one. All this information is stored in a json format file.
The data layer:
CREATE TABLE
First, let’s create a table in our environment.
CREATE TABLE table1 (
order_id BIGINT,
customer_id BIGINT,
order_amount DECIMAL(10, 2),
order_ts TIMESTAMP
)
USING iceberg
PARTITIONED BY ( HOUR(order_ts) );
INSERT
Now, let’s add some data to the table (albeit, literal values).
INSERT INTO table1 VALUES (
123,
456,
36.17,
'2021-01-26 08:10:23'
);
领英推荐
During all of these steps, anyone reading the table would continue to read the first metadata file until the atomic step #5 is complete, meaning that no one using the data would ever see an inconsistent view of the table’s state and contents.
MERGE INTO / UPSERT
In this example, the stage table includes an update for the order that’s already in the table (order_id=123) and a new order that isn’t in the table yet, which occurred on January 27, 2021 at 10:21:46.
MERGE INTO table1 AS t
USING (SELECT * FROM table1_stage) AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN
UPDATE SET t.order_amount = s.order_amount
WHEN NOT MATCHED THEN
INSERT (order_id, order_amount)
VALUES (s.order_id, s.order_amount);
Read Path:
Copy-on-Write:
Merge-on-Read (Upcoming):While not yet implemented, this future Iceberg strategy will provide a different approach to updates, but the current method still uses copy-on-write.
New Data Files:
Manifest Files
Manifest List:
Metadata Updates:
SELECT
Let’s review the SELECT path again, but this time on the Iceberg table we’ve been working on
When this SELECT statement is executed, the following process happens:
Refer: