Apache Iceberg and the Battle for Open Data Control
Upsolver (acquired by Qlik)
Bridges the gap between engineering and data teams by streaming and optimizing operational data in an Iceberg lakehouse.
When it comes to data lakes, catalogs play a crucial role in organizing and managing metadata, enabling efficient data discovery, access, and analysis. However, not all catalogs are created equal.?
In this article, we will explore the three distinct types of catalog, as each serves a specific purpose and caters to different user groups. Then we’ll look at the new and game-changing catalog emerging as a result of the Apache Iceberg lakehouse implementation and why cloud data vendors are battling for control.
1. Technical Data Catalogs
Technical data catalogs are the backbone of data lakes, responsible for collecting and storing metadata from data systems.?
This metadata includes essential information such as table schemas, column names, data types, primary key indicators, and the physical locations of the data files.?
Technical catalogs provide the necessary structural details that enable query engines to understand the data they're working with and locate the relevant files for processing.
In traditional data warehouses and databases, the system defines and maintains this technical metadata.?
However, an external catalog is required to manage this information in data lakes and lakehouses.?
For many years, the Hive Metastore (HMS) has been the go-to solution for this purpose, serving as a central repository for technical metadata in Hadoop-based architectures.
2. Business Data Catalogs
Business data catalogs are designed to make data more accessible and understandable to less/non-technical users, such as business analysts and decision-makers.?
You can think of business data catalogs as Google for your enterprise metadata.
Business data catalogs collect and present metadata in a way that helps users comprehend what each dataset represents and whether it can help answer their business questions.
These catalogs typically build upon the technical metadata by layering taxonomies, tags, annotations, and other business-specific information.?
This enriched metadata aims to provide users with a deeper understanding of the available datasets, enabling them to locate the right data sources for their analytical needs.
Business catalogs operate independently from other data systems within the enterprise, and they rely on crawling and scraping information to remain synced.?
Data teams must ensure business catalogs remain in sync with downstream systems otherwise users could receive out-of-data information.
3. Operational Data Catalogs
Operational data catalogs emerged from the growing importance of data observability – the practice of monitoring, alerting, and troubleshooting data quality and reliability issues.?
These catalogs crawl data systems, extract table metadata, and execute test queries against the data to verify various aspects of table health, such as row counts, field uniqueness, and schema drift.
From a monitoring perspective, operational catalogs provide dashboards, alerts, and custom monitors for data quality and reliability.?
From a metadata management standpoint, these catalogs serve as repositories containing table metadata enriched with health and quality information, catering primarily to data engineers and data operations teams.
Hang on… there’s a fourth catalog…
In a traditional data lake, tables were divided into metadata stored in the Hive Metastore and data stored in physical files.?
However, with the advent of modern data lakehouses and the adoption of open table formats like Apache Iceberg, a significant shift in metadata management has occurred.
A table in an Iceberg lakehouse is divided into state, metadata, and data. The state is maintained in the catalog, with metadata stored in manifest files, and data in physical files.?
This approach offers several advantages:?
领英推荐
Moreover, the introduction of a REST catalog serves as a modern replacement for the aging HMS in the data lake stack. Leveraging a REST API frontend enables any application to use the catalog and HMS to finally be replaced.?
By storing the state of the table, the Iceberg catalog not only provides a technical catalog implementation but also facilitates capabilities such as transactions, concurrency control, and versioned timelines of table changes – features traditionally associated with databases and warehouses.
The Gatekeeper Dilemma
As the industry enters an era of consolidation, a tug-of-war is brewing between data cloud vendors and catalog providers, each vying for control over these open data formats.
One of the key differences between technical catalogs and their business or operational counterparts lies in their role as gatekeepers.?
Technical catalogs not only store metadata about tables but also play a crucial part in query planning and locating the necessary data files for processing.?
This gatekeeper behavior is unique to technical catalogs, as business or operational catalogs do not directly impact query execution.
With the introduction of the Iceberg REST catalog reference implementation, both data cloud vendors and catalog providers have recognized the potential to become gatekeepers to data stored in open formats.?
This single point of control presents an opportunity for these players to influence and potentially limit the capabilities and performance optimizations to tables managed by their catalogs.
Vendor Strategies
Cloud data warehouse vendors, including Snowflake and Databricks, have already begun implementing strategies to exert control over open data formats.?
For instance, Snowflake announced support for two types of Iceberg tables: one using an external catalog (read-only) and another using Snowflake's internally managed catalog (read/write access for Snowflake only).?
This approach forces users to choose between read/write access from Snowflake and read-only access from external tools, or vice versa.
Similarly, Databricks' Unity Catalog supports the Delta Lake table format, offering full read/write and concurrency controls for multiple writers to the same table. Delta Lake will work with other catalogs but functionality is limited.?
Databricks wants other engines and users to integrate with their Unity Catalog to benefit from a full set of capabilities, keeping Delta Lake well and truly coupled.??
Modern lakehouse query engines such as Starburst and Dremio are offering managed Iceberg tables when customers choose their version of the Iceberg REST catalog, thereby limiting the functionality to tables managed by their catalogs.
By simplifying technical catalog requirements and removing the dependency on Hadoop and Java, anyone can build and host a catalog.?Having recognized this crucial point of control with the lakehouse architecture, data platform vendors are grappling for control over tables that should be open.
An Opportunity for Catalog Providers
As the reign of the Hive Metastore comes to an end, catalog vendors, particularly those with technical catalog experience, have an opportunity to embed the Iceberg REST catalog into their platforms.?
Catalog solutions can provide features such as unified access controls, data sharing, collaboration, and compliance assurance across thousands of tables created by different tools and services.?
By offering a unified experience for technical, business, and operational needs, these providers can reduce the control and influence exerted by data cloud vendors, minimizing the risk of vendor lock-in.
The Road Ahead
While it's challenging to predict the future with certainty, some trends are emerging:
Preserving an open, decoupled lakehouse architecture with an unbiased catalog is crucial for collaboration and driving faster use cases for analytics and AI/ML.?
As organizations navigate this evolving architecture, they should prioritize solutions that uphold the principles of openness and vendor neutrality, ensuring that open data remains truly open.