ICEBERG (The new file format that is rocking the Big Data World)

ICEBERG (The new file format that is rocking the Big Data World)

?

Recently I have been working alongside clients that use relational databases and are moving towards Big Data Technologies and data strategies looking towards a future state of artificial intelligence and machine learning implementations. These clients have huge datasets that need to be processed quickly and efficiently. I have worked with Various tech stacks and file formats but wow this one is a game changer and can keep up with the new tech stacks to provide quick results.

Not only does this format reduce query execution times (X10) fold, the combination of data compression, indexing, and chunking makes the iceberg file format a unique and efficient way to store and access large amounts of data. The ability to use Time Travel to visualise data.

No alt text provided for this image

Brief Overview of Apache Iceberg

Apache Iceberg is a data file format that is designed to address the challenges of storing and accessing large amounts of data in a way that is efficient, scalable, and easy to use. It is intended to be a more powerful and flexible alternative to the traditional de facto standard for data storage and management, the Hive table format.

One of the key benefits of the Apache Iceberg format is its ability to handle large amounts of data. I’m talking huge datasets, It achieves this through the use of techniques such as data compression, indexing, and chunking. Data compression involves reducing the size of a file by eliminating redundancy and storing the data more efficiently. Indexing involves creating a list of pointers to specific locations within a file, allowing the data to be accessed quickly and efficiently. Chunking involves dividing a large file into smaller pieces, or chunks, which can be accessed and processed independently.

In addition to its efficiency and scalability, the Apache Iceberg format also offers improved data governance and security. It allows for the tracking of data changes and provides support for data versioning and data lineage, making it easier to understand the history and provenance of data.

The Apache Iceberg format is also well-suited to data lake architectures, which are designed to store and manage large amounts of data from a variety of sources. It provides a flexible and efficient way to store and access data in a data lake, enabling more people and tools to interact with the data and extract value from it.

Overall, the Apache Iceberg data file format is a powerful and flexible tool for storing and accessing large amounts of data in a way that is efficient, scalable, and easy to use. It is well-suited to a variety of applications, including data lake management, data storage, and data analysis.

?

Time Travel is here - Governance dream machine.

Another key capability the Iceberg table format enables is something called “time travel.”

To keep track of the state of a table over time for compliance, reporting, or reproducibility purposes, data engineering traditionally needs to write and manage jobs that create and manage copies of the table at certain points in time.

Instead, Iceberg provides the ability out-of-the-box to see what a table looked like at different points in time in the past. Governance Dream !!.


Lets go Deeper into Iceberg

There are 3 layers in the architecture of an Iceberg table:

1.The Iceberg catalog – Trino can be used here.

2.The metadata layer, which contains metadata files, manifest lists, and manifest files

3.The data layer – ORC – PARQUET etc

Iceberg catalog – HDFS , version-hint.text , Hive Metastore

The main requirement for an Iceberg catalog is that it must support atomic operations for updating the current metadata pointer (e.g., HDFS, Hive Metastore, GCS). This is what allows transactions on Iceberg tables to be atomic.

Metadata file

As the name implies, metadata files store metadata about a table. This includes information about the table’s schema, partition information, snapshots, and which snapshot is the current one. Aswell as versioning etc.

Manifest list

Another aptly named file, the manifest list is a list of manifest files. The manifest list has information about each manifest file that makes up that snapshot, such as the location of the manifest file, what snapshot it was added as part of, and information about the partitions it belongs to and the lower and upper bounds for partition columns for the data files it tracks

Data Files

The underlying physical data files in your chosen format within storage . ORC , PARQUET, AVRO

We have been implementing the iceberg format using underlying data files as (ORC files) either using Spark tech – New Hive V 4.0 and can we also used Trino.

Trino offers an easy way of implementing the iceberg format and only requires a HMS meta-store and a storage device – GCS , S3 , HDFS. Commands all remain very straight forward as documented in technical snippets below.


Geek Time Below

Technical Snippets to use iceberg within Trino

Create a Schema

CREATE SCHEMA iceberg.my_gcs_schem

WITH (location = 'gcs://my-bucket/a/path/');a        

Create a table that’s partitioned as below if required

CREATE TABLE my_table 

???c1 integer,

???c2 date,

???c3 double

)

WITH (

???format = 'ORC',

???partitioning = ARRAY['c1', 'c2'],

???location = 'gcs://my-bucket/a/path/'

);


(        

After that simple insert , update and delete commands can be used.

Optimize iceberg data files.

The?optimize?command is used for rewriting the active content of the specified table so that it is merged into fewer but larger files. In case that the table is partitioned, the data compaction acts separately on each partition selected for optimization. This operation improves read performance.


ALTER TABLE test_table EXECUTE optimize(file_size_threshold => '10MB')        

Expiring Snapshots ( Remove Time travel - Could be due to cost of storage)

The?expire_snapshots?command removes all snapshots and all related metadata and data files. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small. The procedure affects all snapshots that are older than the time period configured with the?retention_threshold?parameter.

ALTER TABLE test_table EXECUTE expire_snapshots(retention_threshold => '7d')?        

Iceberg Specifications.

For example, to update a table from v1 of the Iceberg specification to v2:

ALTER TABLE table_name SET PROPERTIES format_version = 2;        

Row level deletions

Tables using v2 of the Iceberg specification support deletion of individual rows by writing position delete files.

Iceberg Data Management

Includes support for?INSERT,?UPDATE,?DELETE, and?MERGE


Don't be afraid to experiment and implement new Tech . It will make your business respond to the ever changing software industry and gear your organisation towards AI/ML and everything in-between. Reach out to DELIVER BI for any Data Strategy implementation questions and guidance or to architect a future state Data Lake , Warehouse , Business Intelligence system. Upgrades and Support are also provided.

No alt text provided for this image

要查看或添加评论,请登录

社区洞察

其他会员也浏览了