登录查看更多内容

Apache Iceberg and the Battle for Open Data Control

Upsolver (acquired by Qlik)

Bridges the gap between engineering and data teams by streaming and optimizing operational data in an Iceberg lakehouse.

发布日期: 2024年5月17日

When it comes to data lakes, catalogs play a crucial role in organizing and managing metadata, enabling efficient data discovery, access, and analysis. However, not all catalogs are created equal.?

In this article, we will explore the three distinct types of catalog, as each serves a specific purpose and caters to different user groups. Then we’ll look at the new and game-changing catalog emerging as a result of the Apache Iceberg lakehouse implementation and why cloud data vendors are battling for control.

1. Technical Data Catalogs

Technical data catalogs are the backbone of data lakes, responsible for collecting and storing metadata from data systems.?

This metadata includes essential information such as table schemas, column names, data types, primary key indicators, and the physical locations of the data files.?

Technical catalogs provide the necessary structural details that enable query engines to understand the data they're working with and locate the relevant files for processing.

In traditional data warehouses and databases, the system defines and maintains this technical metadata.?

However, an external catalog is required to manage this information in data lakes and lakehouses.?

For many years, the Hive Metastore (HMS) has been the go-to solution for this purpose, serving as a central repository for technical metadata in Hadoop-based architectures.

2. Business Data Catalogs

Business data catalogs are designed to make data more accessible and understandable to less/non-technical users, such as business analysts and decision-makers.?

You can think of business data catalogs as Google for your enterprise metadata.

Business data catalogs collect and present metadata in a way that helps users comprehend what each dataset represents and whether it can help answer their business questions.

These catalogs typically build upon the technical metadata by layering taxonomies, tags, annotations, and other business-specific information.?

This enriched metadata aims to provide users with a deeper understanding of the available datasets, enabling them to locate the right data sources for their analytical needs.

Business catalogs operate independently from other data systems within the enterprise, and they rely on crawling and scraping information to remain synced.?

Data teams must ensure business catalogs remain in sync with downstream systems otherwise users could receive out-of-data information.

3. Operational Data Catalogs

Operational data catalogs emerged from the growing importance of data observability – the practice of monitoring, alerting, and troubleshooting data quality and reliability issues.?

These catalogs crawl data systems, extract table metadata, and execute test queries against the data to verify various aspects of table health, such as row counts, field uniqueness, and schema drift.

From a monitoring perspective, operational catalogs provide dashboards, alerts, and custom monitors for data quality and reliability.?

From a metadata management standpoint, these catalogs serve as repositories containing table metadata enriched with health and quality information, catering primarily to data engineers and data operations teams.

Hang on… there’s a fourth catalog…

In a traditional data lake, tables were divided into metadata stored in the Hive Metastore and data stored in physical files.?

However, with the advent of modern data lakehouses and the adoption of open table formats like Apache Iceberg, a significant shift in metadata management has occurred.

A table in an Iceberg lakehouse is divided into state, metadata, and data. The state is maintained in the catalog, with metadata stored in manifest files, and data in physical files.?

This approach offers several advantages:?

领英推荐

Boost efficiency & cut costs with IDIP Data Fabric

Ignitec Inc 5 个月前

Oracle to Snowflake – ETL Your Data in Minutes with…

Lyftrondata 7 个月前

SQL has made a Big (Data) comeback!

Everlytics Data Science Pte Ltd 1 年前

Ubiquity: all the information about a table is stored in a single location and is fully accessible to any tool.
Flexibility: dynamic changes to schema evolution are fully supported, along with partition updates and time travel.
Consistency: tables deliver the reliability and consistency of tables in a database or warehouse.
Shareability: in the lakehouse, tables can easily be shared across tools, vendors, and cloud platforms.

Moreover, the introduction of a REST catalog serves as a modern replacement for the aging HMS in the data lake stack. Leveraging a REST API frontend enables any application to use the catalog and HMS to finally be replaced.?

By storing the state of the table, the Iceberg catalog not only provides a technical catalog implementation but also facilitates capabilities such as transactions, concurrency control, and versioned timelines of table changes – features traditionally associated with databases and warehouses.

The Gatekeeper Dilemma

As the industry enters an era of consolidation, a tug-of-war is brewing between data cloud vendors and catalog providers, each vying for control over these open data formats.

One of the key differences between technical catalogs and their business or operational counterparts lies in their role as gatekeepers.?

Technical catalogs not only store metadata about tables but also play a crucial part in query planning and locating the necessary data files for processing.?

This gatekeeper behavior is unique to technical catalogs, as business or operational catalogs do not directly impact query execution.

With the introduction of the Iceberg REST catalog reference implementation, both data cloud vendors and catalog providers have recognized the potential to become gatekeepers to data stored in open formats.?

This single point of control presents an opportunity for these players to influence and potentially limit the capabilities and performance optimizations to tables managed by their catalogs.

Vendor Strategies

Cloud data warehouse vendors, including Snowflake and Databricks, have already begun implementing strategies to exert control over open data formats.?

For instance, Snowflake announced support for two types of Iceberg tables: one using an external catalog (read-only) and another using Snowflake's internally managed catalog (read/write access for Snowflake only).?

This approach forces users to choose between read/write access from Snowflake and read-only access from external tools, or vice versa.

Similarly, Databricks' Unity Catalog supports the Delta Lake table format, offering full read/write and concurrency controls for multiple writers to the same table. Delta Lake will work with other catalogs but functionality is limited.?

Databricks wants other engines and users to integrate with their Unity Catalog to benefit from a full set of capabilities, keeping Delta Lake well and truly coupled.??

Modern lakehouse query engines such as Starburst and Dremio are offering managed Iceberg tables when customers choose their version of the Iceberg REST catalog, thereby limiting the functionality to tables managed by their catalogs.

By simplifying technical catalog requirements and removing the dependency on Hadoop and Java, anyone can build and host a catalog.?Having recognized this crucial point of control with the lakehouse architecture, data platform vendors are grappling for control over tables that should be open.

An Opportunity for Catalog Providers

As the reign of the Hive Metastore comes to an end, catalog vendors, particularly those with technical catalog experience, have an opportunity to embed the Iceberg REST catalog into their platforms.?

Catalog solutions can provide features such as unified access controls, data sharing, collaboration, and compliance assurance across thousands of tables created by different tools and services.?

By offering a unified experience for technical, business, and operational needs, these providers can reduce the control and influence exerted by data cloud vendors, minimizing the risk of vendor lock-in.

The Road Ahead

While it's challenging to predict the future with certainty, some trends are emerging:

Data cloud and unified query vendors will continue expanding their support for fully managed Iceberg tables using their proprietary catalogs, pushing partners and open-source tools to support their catalogs and build out their own ecosystems.
Technical, business, and operational catalog solutions will consolidate, offering unified user experiences and driving more value to a broader range of users.
New catalogs tailored for open table formats will emerge, providing the gamut of features needed for multi-table, mutable, collaborative, and high-performance lakehouses.

Preserving an open, decoupled lakehouse architecture with an unbiased catalog is crucial for collaboration and driving faster use cases for analytics and AI/ML.?

As organizations navigate this evolving architecture, they should prioritize solutions that uphold the principles of openness and vendor neutrality, ensuring that open data remains truly open.

Apache Iceberg and the Battle for Open Data Control

Upsolver (acquired by Qlik)

Bridges the gap between engineering and data teams by streaming and optimizing operational data in an Iceberg lakehouse.

1. Technical Data Catalogs

2. Business Data Catalogs

3. Operational Data Catalogs

Hang on… there’s a fourth catalog…

领英推荐

The Gatekeeper Dilemma

Vendor Strategies

An Opportunity for Catalog Providers

The Road Ahead

Upsolver (acquired by Qlik)的更多文章

社区洞察

其他会员也浏览了

Modern Data Quality with Netezza: A Game-Changer for Your Data Ecosystem

How to Get Started With ADF As a Beginner?

Clustering vs Partitioning your Apache Iceberg Tables

What is the Data Lakehouse and the Role of Apache Iceberg, Nessie and Dremio?

Data Lakehouse 101: The Who, What and Why of Data Lakehouses

Snowflake

Understanding Apache Iceberg's Metadata.json

Overview of Discord's data platform that daily processes petabytes of data

Lakehouse

Tanzu Data in 2025: Optionality of Data Engines, Deployment Flexibility, and Data Strategy

1. Technical Data Catalogs

2. Business Data Catalogs

3. Operational Data Catalogs

Hang on… there’s a fourth catalog…

领英推荐

The Gatekeeper Dilemma

Vendor Strategies

An Opportunity for Catalog Providers

The Road Ahead

Upsolver (acquired by Qlik)的更多文章

Can You Really Query Fresh Data in Apache Iceberg Tables from Snowflake?

Slash Your Data Warehouse Costs with Apache Iceberg

Visualizing the Data Journey with Lineage

High Scale Ingestion Meets Big Data Analytics

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

10 Things You Need to Know About Apache Iceberg

Unlocking the Potential of Apache Iceberg

Meet the Women of Upsolver

On the Sofa with Upsolver's Founders

社区洞察

其他会员也浏览了

Modern Data Quality with Netezza: A Game-Changer for Your Data Ecosystem

How to Get Started With ADF As a Beginner?

Clustering vs Partitioning your Apache Iceberg Tables

What is the Data Lakehouse and the Role of Apache Iceberg, Nessie and Dremio?

Data Lakehouse 101: The Who, What and Why of Data Lakehouses

Snowflake

Understanding Apache Iceberg's Metadata.json

Overview of Discord's data platform that daily processes petabytes of data

Lakehouse

Tanzu Data in 2025: Optionality of Data Engines, Deployment Flexibility, and Data Strategy