登录查看更多内容

ICEBERG (The new file format that is rocking the Big Data World)

Shahed Munir

Big Data Cloud Architect & Developer GCP , Oracle and Azure, Specialising in Business Intelligence

发布日期: 2022年12月16日

Recently I have been working alongside clients that use relational databases and are moving towards Big Data Technologies and data strategies looking towards a future state of artificial intelligence and machine learning implementations. These clients have huge datasets that need to be processed quickly and efficiently. I have worked with Various tech stacks and file formats but wow this one is a game changer and can keep up with the new tech stacks to provide quick results.

Not only does this format reduce query execution times (X10) fold, the combination of data compression, indexing, and chunking makes the iceberg file format a unique and efficient way to store and access large amounts of data. The ability to use Time Travel to visualise data.

Brief Overview of Apache Iceberg

Apache Iceberg is a data file format that is designed to address the challenges of storing and accessing large amounts of data in a way that is efficient, scalable, and easy to use. It is intended to be a more powerful and flexible alternative to the traditional de facto standard for data storage and management, the Hive table format.

One of the key benefits of the Apache Iceberg format is its ability to handle large amounts of data. I’m talking huge datasets, It achieves this through the use of techniques such as data compression, indexing, and chunking. Data compression involves reducing the size of a file by eliminating redundancy and storing the data more efficiently. Indexing involves creating a list of pointers to specific locations within a file, allowing the data to be accessed quickly and efficiently. Chunking involves dividing a large file into smaller pieces, or chunks, which can be accessed and processed independently.

In addition to its efficiency and scalability, the Apache Iceberg format also offers improved data governance and security. It allows for the tracking of data changes and provides support for data versioning and data lineage, making it easier to understand the history and provenance of data.

The Apache Iceberg format is also well-suited to data lake architectures, which are designed to store and manage large amounts of data from a variety of sources. It provides a flexible and efficient way to store and access data in a data lake, enabling more people and tools to interact with the data and extract value from it.

Overall, the Apache Iceberg data file format is a powerful and flexible tool for storing and accessing large amounts of data in a way that is efficient, scalable, and easy to use. It is well-suited to a variety of applications, including data lake management, data storage, and data analysis.

Time Travel is here - Governance dream machine.

Another key capability the Iceberg table format enables is something called “time travel.”

To keep track of the state of a table over time for compliance, reporting, or reproducibility purposes, data engineering traditionally needs to write and manage jobs that create and manage copies of the table at certain points in time.

Instead, Iceberg provides the ability out-of-the-box to see what a table looked like at different points in time in the past. Governance Dream !!.

Lets go Deeper into Iceberg

There are 3 layers in the architecture of an Iceberg table:

1.The Iceberg catalog – Trino can be used here.

2.The metadata layer, which contains metadata files, manifest lists, and manifest files

3.The data layer – ORC – PARQUET etc

Iceberg catalog – HDFS , version-hint.text , Hive Metastore

The main requirement for an Iceberg catalog is that it must support atomic operations for updating the current metadata pointer (e.g., HDFS, Hive Metastore, GCS). This is what allows transactions on Iceberg tables to be atomic.

Metadata file

As the name implies, metadata files store metadata about a table. This includes information about the table’s schema, partition information, snapshots, and which snapshot is the current one. Aswell as versioning etc.

Manifest list

Another aptly named file, the manifest list is a list of manifest files. The manifest list has information about each manifest file that makes up that snapshot, such as the location of the manifest file, what snapshot it was added as part of, and information about the partitions it belongs to and the lower and upper bounds for partition columns for the data files it tracks

ITC Infotech 1 年前

Druid Apache: Revolutionizing Real-Time Data Analytics

Data & Analytics 7 个月前

9 Predictions for Data in 2023

Tomasz Tunguz 2 年前

Data Files

The underlying physical data files in your chosen format within storage . ORC , PARQUET, AVRO

We have been implementing the iceberg format using underlying data files as (ORC files) either using Spark tech – New Hive V 4.0 and can we also used Trino.

Trino offers an easy way of implementing the iceberg format and only requires a HMS meta-store and a storage device – GCS , S3 , HDFS. Commands all remain very straight forward as documented in technical snippets below.

Geek Time Below

Technical Snippets to use iceberg within Trino

Create a Schema

CREATE SCHEMA iceberg.my_gcs_schem

WITH (location = 'gcs://my-bucket/a/path/');a

Create a table that’s partitioned as below if required

CREATE TABLE my_table 

???c1 integer,

???c2 date,

???c3 double

)

WITH (

???format = 'ORC',

???partitioning = ARRAY['c1', 'c2'],

???location = 'gcs://my-bucket/a/path/'

);


(

After that simple insert , update and delete commands can be used.

Optimize iceberg data files.

The?optimize?command is used for rewriting the active content of the specified table so that it is merged into fewer but larger files. In case that the table is partitioned, the data compaction acts separately on each partition selected for optimization. This operation improves read performance.


ALTER TABLE test_table EXECUTE optimize(file_size_threshold => '10MB')

Expiring Snapshots ( Remove Time travel - Could be due to cost of storage)

The?expire_snapshots?command removes all snapshots and all related metadata and data files. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small. The procedure affects all snapshots that are older than the time period configured with the?retention_threshold?parameter.

ALTER TABLE test_table EXECUTE expire_snapshots(retention_threshold => '7d')?

Iceberg Specifications.

For example, to update a table from v1 of the Iceberg specification to v2:

ALTER TABLE table_name SET PROPERTIES format_version = 2;

Row level deletions

Tables using v2 of the Iceberg specification support deletion of individual rows by writing position delete files.

Iceberg Data Management

Includes support for?INSERT,?UPDATE,?DELETE, and?MERGE

Don't be afraid to experiment and implement new Tech . It will make your business respond to the ever changing software industry and gear your organisation towards AI/ML and everything in-between. Reach out to DELIVER BI for any Data Strategy implementation questions and guidance or to architect a future state Data Lake , Warehouse , Business Intelligence system. Upgrades and Support are also provided.

要查看或添加评论，请登录

查看全部

ICEBERG (The new file format that is rocking the Big Data World)

Shahed Munir

Big Data Cloud Architect & Developer GCP , Oracle and Azure, Specialising in Business Intelligence

Brief Overview of Apache Iceberg

Time Travel is here - Governance dream machine.

Lets go Deeper into Iceberg

Iceberg catalog – HDFS , version-hint.text , Hive Metastore

Metadata file

Manifest list

领英推荐

Data Files

Geek Time Below

更多精彩文章

社区洞察

其他会员也浏览了

The Five Important Trends in Data, and the One Megatrend Powering Them All

Come Hell or High Water: Some Lessons from Four Years of Data Mesh Implementations Learned the Hard Way: Lesson One

Advanced Data Analytics with Apache’s Cutting-Edge Tools

Disrupting the Data Storage Landscape: How Vector Databases are Revolutionizing Traditional Storage Methods

Microsoft Fabric Data Warehouse - The Polaris engine

8 Data Structures Powering Modern Databases-Scaler

Analytics and Data Science News for the Week of September 20; Updates from Firebolt, Qrvey, Teradata & More

Setting up Data Collection (ML4Devs Newsletter, Issue 5)

Mastering Data Loading in Microsoft Fabric: A Comprehensive Guide

How to Use a Semantic Layer Platform to Make Smarter Data-Driven Decisions at Scale

Brief Overview of Apache Iceberg

Time Travel is here - Governance dream machine.

Lets go Deeper into Iceberg

Iceberg catalog – HDFS , version-hint.text , Hive Metastore

Metadata file

Manifest list

领英推荐

Data Files

Geek Time Below

Top 3 Analytical Applications for 2025

2024年10月18日

Gemini in BigQuery - Transforming Household Bill Management (Without the Headaches)

2024年8月14日

Trino (Presto) Graceful Shutdown on GCP Using Instance Groups and a cheeky shell script.

2021年10月25日

Jenkins Vs Spinnaker Continuous Deployment

2020年3月27日

Hive HPLSQL on Google DataProc

2018年11月8日

Power BI Direct Query ODBC Custom Connectors

2018年11月6日

Data Migration to Google Cloud with Zero Development using DataRocket PRO v1.0

2018年4月17日

OBIEE 12c Impersonate a user Configuration

2016年2月11日

OBIEE 12c Archive \ Export a Report to File system Linux

2015年12月16日

OBIA 11.1.1.7.1 / 11.1.1.8.1 Custom BIACM Offerings for Load Plan / Custom Fact Group

2014年9月4日

社区洞察

其他会员也浏览了

The Five Important Trends in Data, and the One Megatrend Powering Them All

Come Hell or High Water: Some Lessons from Four Years of Data Mesh Implementations Learned the Hard Way: Lesson One

Advanced Data Analytics with Apache’s Cutting-Edge Tools

Disrupting the Data Storage Landscape: How Vector Databases are Revolutionizing Traditional Storage Methods

Microsoft Fabric Data Warehouse - The Polaris engine

8 Data Structures Powering Modern Databases-Scaler

Analytics and Data Science News for the Week of September 20; Updates from Firebolt, Qrvey, Teradata & More

Setting up Data Collection (ML4Devs Newsletter, Issue 5)

Mastering Data Loading in Microsoft Fabric: A Comprehensive Guide

How to Use a Semantic Layer Platform to Make Smarter Data-Driven Decisions at Scale