ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Icebergâ€™s Icy Ascent: How Apache Iceberg Became the Table Format of the Future

Rakesh Gupta

Consulting | Data | GenAI | IoT | Edge Computing

å‘å¸ƒæ—¥æœŸ: 2024å¹´11æœˆ21æ—¥

For years, the data engineering world wrestled with a critical question: which open table format would dominate the future? Would Delta Lakeâ€™s seamless Databricks integration prevail? Could Apache Hudi, with its streaming-first ethos, maintain an edge? Or would Apache Iceberg, quietly innovative, emerge victorious?

As of late 2024, the answer is clear: Apache Iceberg has won.

A Turning Point in 2024

Several key developments in 2024 sealed Icebergâ€™s position:

Databricks acquired Tabular, founded by Icebergâ€™s creators, solidifying Icebergâ€™s stature in the data ecosystem.
Snowflake unveiled Polaris, an Iceberg-based catalog, with support from major query engines like Starburst and Dremio.
GitHub activity showed Iceberg gaining significant traction, underscoring its growing developer base and widespread adoption.

Together, these moves signaled a decisive industry shift toward Iceberg as the de facto open table format. But Icebergâ€™s story is far from over. The upcoming advancements in 2025 promise to cement its dominance and expand its utility across diverse data workflows.

Whatâ€™s Next for Iceberg in 2025?

1. RBAC Catalog: Simplifying Permissions at Scale

Data lake permissions have long been a challenge, often cobbled together with bucket-level rules or engine-specific controls. These fragmented methods are inefficient and prone to security gaps.

Icebergâ€™s new OpenAPI specification (PR #10722) changes the game. By standardizing credential structures, Iceberg introduces built-in Role-Based Access Control (RBAC) capabilities at the catalog level.

What it enables: Administrators can define fine-grained access policies that are independent of the underlying storage or query engine.
Why it matters: This rivals enterprise-grade solutions like Databricks Unity Catalog but maintains Icebergâ€™s hallmark openness and flexibility.

2. Change Data Capture (CDC): Icebergâ€™s Streaming Evolution

Historically, Iceberg wasnâ€™t considered ideal for streaming due to limited CDC capabilities. While versioned table snapshots supported some CDC use cases, high-frequency data changes and real-time analytics were less efficient.

Enter Iceberg Spec V3, featuring Row Lineage.

é¢†è‹±æŽ¨è

AIM Weekly 09 Sept 2024

Tim Spann 6 ä¸ªæœˆå‰

The ScyllaDB Sync: November 2024

ScyllaDB 4 ä¸ªæœˆå‰

GroupBy #17: Pinterestâ€™s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

GroupBy #17: Pinterestâ€™s new wide column databaseâ€¦

Vu Trinh 1 å¹´å‰

Whatâ€™s new: Row Lineage tracks individual row changes â€” updates, deletes, and inserts â€” enabling efficient CDC pipelines.
Why it matters: Materialized view maintenance and real-time data synchronization become far more seamless. Once fully implemented, Iceberg will rival streaming-first platforms like Kafka and Hudi for real-time applications.

3. Materialized Views: Streamlining Derived Data

Derived datasets â€” aggregations, metrics, and other transformations â€” are critical for unlocking data value but have been cumbersome to manage with Iceberg.

A proposed materialized views feature (PR #11041) introduces built-in support for precomputed results stored as tables.

What it enables: Faster query performance and automatic updates when the base table changes.
Why it matters: It simplifies dependency tracking and reduces the overhead of managing derived data, opening up opportunities for systems like RisingWave to enhance the experience further.

Beyond Features: Icebergâ€™s Ecosystem Growth

As Icebergâ€™s capabilities evolve, so does its ecosystem. Highlights to watch in 2025 include:

Support for nanosecond-precision timestamps: Critical for industries like finance and telecoms that demand high-precision data.
Binary deletion vectors: Part of Spec V3, offering scalable and efficient deletion handling for regulatory compliance and GDPR requirements.
Expanded engine compatibility: Iceberg already integrates with Kafka, PostgreSQL (via RisingWave), and query engines like Trino, Databricks, and Snowflake.

The One Missing Piece: Lightweight Compaction

Iceberg excels in many areas, but compaction remains a bottleneck, typically relying on resource-intensive Spark jobs.

This limits adoption for smaller teams and SQL/Python-centric users who need simpler, more resource-efficient options. Fortunately, the community recognizes this gap, and momentum is building for a lightweight, engine-agnostic compaction framework.

The Road Ahead

With innovations like RBAC catalogs, advanced streaming capabilities, materialized views, and new data type support, Apache Iceberg is on track to become the universal table format for modern data engineering.

2024 marked Icebergâ€™s victory in the format wars. 2025 will be about making it more accessible, versatile, and powerful for users of all sizes â€” from startups to global enterprises. Whether youâ€™re managing historical data, building real-time pipelines, or exploring cutting-edge lakehouse designs, Iceberg offers something for everyone.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Rakesh Guptaçš„æ›´å¤šæ–‡ç«

Seamlessly Integrating Streamlit with AWS Bedrock: Building an Authenticated RAG App for Northwind Orders

2025å¹´3æœˆ19æ—¥

Seamlessly Integrating Streamlit with AWS Bedrock: Building an Authenticated RAG App for Northwind Orders

Introduction In this tutorial, youâ€™ll explore how to integrate Streamlit with AWS Bedrock while using Djangoâ€¦
From Flat Files in S3 to Smart Insights: Building a RAG System for Northwind Orders Using Amazon Bedrock and Pinecone

2025å¹´3æœˆ13æ—¥

From Flat Files in S3 to Smart Insights: Building a RAG System for Northwind Orders Using Amazon Bedrock and Pinecone

Introduction: Turning Data into Knowledge In today's data-driven world, businesses thrive on actionable insights. Rawâ€¦
Extracting Data from MS Fabric Warehouse to AWS S3 via Fabric Pipeline â€“ A Practical Data Engineering Example

2025å¹´3æœˆ7æ—¥

Extracting Data from MS Fabric Warehouse to AWS S3 via Fabric Pipeline â€“ A Practical Data Engineering Example

Introduction In todayâ€™s data-driven landscape, organizations constantly seek efficient ways to extract, transform, andâ€¦
Building a Robust Data Analytics with Microsoft Fabric and dbt

2025å¹´2æœˆ19æ—¥

Building a Robust Data Analytics with Microsoft Fabric and dbt

Introduction In data engineering, ingesting data from multiple sources is just the beginning. The crucial next stepsâ€¦

2 æ¡è¯„è®º
Ditch the Overhead: Metadata-Driven Data Transformation with DuckDB & Polars in Microsoft Fabric

2025å¹´2æœˆ13æ—¥

Ditch the Overhead: Metadata-Driven Data Transformation with DuckDB & Polars in Microsoft Fabric

Context When building a data platformâ€”whether you're working alone as a data engineer or as part of a teamâ€”you may haveâ€¦

3 æ¡è¯„è®º
Addressing Data Transformation Challenges: A Strategic Initiative in MS Fabric

2025å¹´2æœˆ12æ—¥

Addressing Data Transformation Challenges: A Strategic Initiative in MS Fabric

Through extensive experience working with diverse consulting clients and teams, I have observed a recurring pattern:â€¦

1 æ¡è¯„è®º
Setting up ingestion with MS Fabric, dltHub, and Lakehouse

2025å¹´2æœˆ4æ—¥

Setting up ingestion with MS Fabric, dltHub, and Lakehouse

This is the first part of a six-part series on building end-to-end analytics in Microsoft Fabricâ€”without Sparkâ€¦

1 æ¡è¯„è®º
Building Your Data Flywheel: The Five Essential Elements

2024å¹´12æœˆ10æ—¥

Building Your Data Flywheel: The Five Essential Elements

The concept of a data flywheel has become a cornerstone for organizations aiming to harness the power of data to driveâ€¦
Simplifying Real-Time Batch Ingestion: Streamlining with Databricks and Kafka for Better Performance and Manageability

2024å¹´11æœˆ14æ—¥

Simplifying Real-Time Batch Ingestion: Streamlining with Databricks and Kafka for Better Performance and Manageability

As data processing architectures continue to evolve, thereâ€™s a growing need to refine them for efficiencyâ€¦
Why deliver your data as a product, but not as an application is important and critical

2024å¹´11æœˆ13æ—¥

Why deliver your data as a product, but not as an application is important and critical

Delivering data as a product rather than an application is increasingly recognized as a critical strategy forâ€¦

See all articles

Icebergâ€™s Icy Ascent: How Apache Iceberg Became the Table Format of the Future

Rakesh Gupta

Consulting | Data | GenAI | IoT | Edge Computing

A Turning Point in 2024

Whatâ€™s Next for Iceberg in 2025?

1. RBAC Catalog: Simplifying Permissions at Scale

2. Change Data Capture (CDC): Icebergâ€™s Streaming Evolution

é¢†è‹±æŽ¨è

3. Materialized Views: Streamlining Derived Data

Beyond Features: Icebergâ€™s Ecosystem Growth

The One Missing Piece: Lightweight Compaction

The Road Ahead

Rakesh Guptaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Summer 2023: New Druid release, new Polaris updates, new customer stories, new events â€¦and more!

2021 Schedule

Data Wars: Vector Strikes Back

DATA Pill #030 - news from AWS and GitHub, creative testing, Search Pipeline and more

Iceberg vs. Hudi vs. Delta Lake: Choosing the Right Open Table Format for Your Data Lake

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

Intro to the Iceberg Kafka Connect Sink

A Very Modern Data Stack

DoubleCloudâ€™s 14th Product Update

?? DATA Pill #136 - From Apache Iceberg to Real-Time AI: Trends, Tutorials, and Tools for Modern Data Pros

A Turning Point in 2024

Whatâ€™s Next for Iceberg in 2025?

1. RBAC Catalog: Simplifying Permissions at Scale

2. Change Data Capture (CDC): Icebergâ€™s Streaming Evolution

é¢†è‹±æŽ¨è

3. Materialized Views: Streamlining Derived Data

Beyond Features: Icebergâ€™s Ecosystem Growth

The One Missing Piece: Lightweight Compaction

The Road Ahead

Rakesh Guptaçš„æ›´å¤šæ–‡ç«

Seamlessly Integrating Streamlit with AWS Bedrock: Building an Authenticated RAG App for Northwind Orders

From Flat Files in S3 to Smart Insights: Building a RAG System for Northwind Orders Using Amazon Bedrock and Pinecone

Extracting Data from MS Fabric Warehouse to AWS S3 via Fabric Pipeline â€“ A Practical Data Engineering Example

Building a Robust Data Analytics with Microsoft Fabric and dbt

Ditch the Overhead: Metadata-Driven Data Transformation with DuckDB & Polars in Microsoft Fabric

Addressing Data Transformation Challenges: A Strategic Initiative in MS Fabric

Setting up ingestion with MS Fabric, dltHub, and Lakehouse

Building Your Data Flywheel: The Five Essential Elements

Simplifying Real-Time Batch Ingestion: Streamlining with Databricks and Kafka for Better Performance and Manageability

Why deliver your data as a product, but not as an application is important and critical

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Summer 2023: New Druid release, new Polaris updates, new customer stories, new events â€¦and more!

2021 Schedule

Data Wars: Vector Strikes Back

DATA Pill #030 - news from AWS and GitHub, creative testing, Search Pipeline and more

Iceberg vs. Hudi vs. Delta Lake: Choosing the Right Open Table Format for Your Data Lake

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

Intro to the Iceberg Kafka Connect Sink

A Very Modern Data Stack

DoubleCloudâ€™s 14th Product Update

?? DATA Pill #136 - From Apache Iceberg to Real-Time AI: Trends, Tutorials, and Tools for Modern Data Pros

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†