Modern Data Catalogs and Semantic Layers

Modern Data Catalogs and Semantic Layers

In today's data-driven world, organizations are inundated with vast amounts of information, making effective data management crucial. As we delve into the landscape of modern data catalogs, it's essential to clarify the terminology and understand the various types of catalogs available. This article will explore the distinctions between technical and business data catalogs, the role of semantic layers, and the unique offerings of platforms like Snowflake, particularly its Polaris and Horizon catalogs.

Understanding the Landscape of Data Catalogs

Before diving into specifics, let’s align on terminology. The term “Data Catalog” is often used inconsistently across the industry. To clarify, we’ll distinguish different types of data catalogs. Later, we’ll also introduce the role of semantic layers.

A key question in the context of Snowflake arises: “Why does Snowflake have two data catalogs—Polaris and Horizon?”

The answer lies in the industry’s use of one term for two (or even three) distinct types of services, some of which serve catalog-like functions but are technically different, plus another that is not a Data Catalog but serves a similar purpose.

The Four Key Layers of Data Catalogs

To better understand how data catalogs function, we can broadly distinguish the following levels:

1.Data Definition and Manipulation

This layer manages data storage, enabling transactional capabilities, data and schema versioning, and other storage-related tasks. It is typically included in the database engine and translates Data Definition instructions into lower-level instructions back to INFORMATION_SCHEMA and the data itself or Data Lake houses.

2. Data Access Control

This layer governs access at various levels. It may include simple privilege granting, role-based access control (RBAC), tag-based security policies, row- and column-level security, and aggregation policies. This level is typically integrated into database engines and Data Fabrics.

3. Data Asset Management.

This encompasses a wide range of tasks:

  • Describing data and providing its characteristics
  • Facilitates data discovery.
  • Standardizes data terminology through a ubiquitous language.
  • Provides visibility into data lineage from origin to derivative assets.
  • Manages integration with the preceding two levels.

4. Semantic Layer

Enriches standard data catalogs with contextual business information, including metrics, descriptions, dimensions, and hierarchical relationships. This layer enhances analytical capabilities by embedding business logic directly into the data AP s

From the user’s perspective, it can be described as follows:


A technical data catalog (sometimes called metastore, not precisely the correct naming for modern catalogs) is closely linked to the data itself, storing most of its metadata alongside the data. A business data catalog, on the other hand, is usually a separate data governance layer that integrates with it. Finally, the semantic layer can be a part of the analytical engine and a separate semantic layer and potentially extend the technical data catalog with some capabilities of the data catalog.

The significant difference between the business data catalog and the semantic layer is that the latter should be part of the data API, enabling seamless integration without requiring external services.

Polaris: Data Definition, Manipulation, and Access Control Service

Polaris falls under the first two categories, acting as a technical catalog for your data. Every data management system has its catalog, ranging from simple file lists to complex systems like Polaris, which tracks current and historical data partitions and changes.

A unique feature of Polaris is its ability to delegate data manipulation to other catalogs. For instance, you can manage data across AWS Glue or Azure OneLake from a single place without relying on external services. Unlike traditional built-in catalogs, Polaris operates independently, allowing users to choose their preferred computing resources, such as Snowflake or Databricks. This flexibility minimizes overhead and optimizes computing and storage costs.?

Polaris isn’t the only option. Unity, Glue Data Catalog, and Hive also support Iceberg tables, offering alternatives with varying strength. The list of services is not final and will continue to grow. Snowflake claims that Polaris’s power is its efficiency, enabling cost reductions in computing and storage when using Snowflake. Further details can be found in this article on building an open data lakehouse and this piece on unifying Iceberg tables.

A key advantage of Polaris, similar to Unity, is its open-source nature and relative independence. This makes it easier to establish a unified interoperability layer for a full-blown Data Mesh architecture or its components with minimal effort. While "open-source" might not be the most precise term in this context, “independent” is more accurate. However, this flexibility is limited to data processors that support Iceberg. Looking ahead, choosing non-compatible solutions is becoming increasingly unlikely. Although ETL processes will not be entirely eliminated, we can significantly reduce data transfer and additional storage costs.


Data Catalogs for Asset Management

Business data catalogs focus on cataloging and exploration, managing information that may include technical metadata (e.g., tables, files, lineage diagrams, and technical data quality) but concentrates on treating data as assets. Objects in a data catalog need meaningful names and business descriptions to serve their purpose. So, integrating a business glossary linked to technical metadata is an excellent way to make data more comprehensible to its users.

Normally, a business data catalog acts as a standalone or proprietary built-in service and has some specific components, such as a search engine, GraphQL API, developed attribute set, and classification capabilities.


The purpose often determines usage. For instance, using a Data Asset Management service tightly coupled with a specific Data Platform usually limits its ability to serve broader business needs, as assets are rarely confined to a single solution. Examples include Azure Preview and Snowflake Horizon. AWS Glue, however, falls short as a comprehensive Data Asset Management service.

The decision often boils down to vendor lock-in versus better integration. It might seem straightforward, but the trade-off isn’t always obvious. Balancing both worlds is often the best approach.

Let’s compare platform-specific data catalogs to independent ones and analyze their respective advantages. I’ll use Snowflake Horizon and Collate (OpenMetadata) as examples.

Snowflake Horizon

Initially introduced as a Snowflake-specific data catalog, this solution offers deep integration with Snowflake, enabling rapid onboarding of new features. Recently, support for the Polaris catalog was added, allowing it to be indexed and managed. When using Horizon with Polaris, features such as tags, column-level security, and other integrations become available.

A key advantage is the bidirectional connection: There are no boundaries between the catalog and the platform, seamlessly ensuring that changes in the catalog immediately reflect in platform behavior. It is simple, user-friendly, cost-effective, and easy to start, allowing you to maximize the service.

However, incorporating additional sources with Horizon is challenging:

  • If your data team uses tools like dbt (for data modeling), Airflow (for orchestration), AWS Glue, or OpenLake, relying solely on Horizon can complicate workflows.
  • You may require multiple data catalogs, which could become redundant.
  • Alternatively, using an independent service could be more effective:
  • Acts as a central repository for business metadata
  • Supports data exploration and impact analysis
  • Stores a business glossary.
  • Consolidates data observability and quality metrics from various sources.
  • Vendor neutrality (or service agnosticism) is crucial for a central data catalog, ensuring it functions as a universal data language.

Collate (OpenMetadata) offers an alternative. Originating at Uber, it evolved into an open-source solution, blending community-driven and proprietary features. This dual nature provides flexibility between enterprise support and startup-style innovation.

Key features include:

1.Data Exploration and Discovery

  • Catalog browser.
  • Search for data objects (databases, tables, schemas, dashboards, ML models, unstructured storages, etc.), glossaries, tags, and more with advanced filters. While search prioritization is limited, improvements are planned.
  • Schema visibility for better understanding of objects.
  • Data lineage at the table and column level, supporting tools like Tableau, Power BI, and DBT.
  • Built-in data sampling and profiling for deeper data insights.
  • Entity-relationship diagrams (ERDs), with current capabilities focused on foreign-primary key relationships. Plans exist to expand this feature.

2. Data Quality

  • Collate integrates with tools like Soda, Great Expectations, and dbt, serving as a control center for data quality. You can build dashboards, send alerts, manage incidents, and assess data quality coverage within one tool.

3. Platform Integration

  • Extensive API support for data ingestion.
  • Broad compatibility with various systems—details are available on the official Collate site.

4. AI-Driven Features

  • Automated documentation of data assets.
  • Natural language SQL query generation and optimization.
  • Automated data quality testing and outlier detection.

While Horizon provides overlapping features such as ERDs and glossary capabilities, it excels in Snowflake-specific functionalities like compute resource management and role control. Its AI capabilities are broader in areas like data classification, quality control, and access history.

Choosing between these solutions involves trade-offs, but picking just one isn’t necessary. Adding a dedicated layer for managing data assets is feasible. Horizon, in particular, provides robust compliance, privacy, and security features, making it a solid choice for a security layer. While maintaining dual systems requires additional effort for integration and user training, this approach can be implemented organically, especially since Horizon is built into the Snowflake ecosystem.

Planning and Function Mapping

Effective planning is essential when managing multiple tools. Below is a feature comparison to guide decision-making:


This list is not exhaustive and can be expanded based on specific needs. Certain features may transition to different levels as new data platform components or features are adopted.

Semantic Layer and Its Role in Data Governance

The semantic layer is a class of service components that can complement or partially overlap with data catalogs' functions. It serves as an enhanced data catalog that focuses on data explanation and accessibility.

A simple semantic layer could be as basic as a data mart with defined foreign key relationships between fact and dimension tables. Acting as an intermediary layer, it describes the data and provides additional metadata. Historically, this layer was embedded within BI systems, but as organizations mature in their data governance practices, it is evolving into an independent component of the data stack.

Key features of a semantic layer include:

1.Metric Definitions

  • Centralized metrics management ensures consistency across reports and dashboards.
  • Defines how metrics are calculated and aggregated, avoiding discrepancies caused by inconsistent logic in different tools.

2. Dimension Hierarchies and Relationships

  • Outlines relationships between dimensions and facts, enabling hierarchical data exploration. For example, supports structures like time-based hierarchies (Year → Quarter → Month) or organizational hierarchies (Region → Country → City).

3. Business Glossary and Terminology

  • Similar to a data catalog, includes a glossary defining business terms, providing users with context for metrics and dimensions. However, this information is presented more closely to the data, often within reporting tools.

4. Proximity to Data

  • Unlike standalone data catalogs, the semantic layer is embedded within or directly integrated with client-facing tools, such as BI systems or data visualization platforms. This proximity reduces the need for users to reference external catalogs, as data explanations and definitions are readily accessible.

5. Improved Data Governance

  • Ensures consistent application of data governance rules, including access controls, privacy policies, and metric definitions.

6. Enhanced User Experience

By integrating directly with data and visualization tools, the semantic layer empowers non-technical users to interact with data more intuitively. For example:

  • Natural language querying (e.g., “Show sales by region for the last quarter”)
  • Metric explanations embedded in dashboards
  • Role-based access, ensuring users see only the data they are authorized to view

Future Impact of the Semantic Layer

As the semantic layer evolves, it has the potential to reshape the data governance landscape significantly in several ways:

  • Reducing the Need for ETL: By enabling dynamic transformations and aggregations at query time, semantic layers may simplify or even replace certain ETL processes.
  • Decentralization in Data Mesh: In a data mesh architecture, the semantic layer could act as a shared layer of understanding across federated domains, ensuring interoperability and consistency without relying on a centralized monolith.
  • Bridging Technical and Business Users: The semantic layer provides a shared language for both technical and business teams, fostering collaboration and reducing miscommunication.

Integration with Data Catalogs

While the semantic layer and data catalogs share overlapping functionalities, they serve different purposes and can complement each other effectively:

  • Data Catalogs focus on metadata management, exploration, and governance. They provide an overarching view of data assets and ensure data quality, compliance, and observability.
  • Semantic Layers enhance usability by enabling intuitive interaction with data. They improve data accessibility within specific tools or applications.

By combining these tools, organizations can balance governance with accessibility. For instance, while the data catalog serves as a central repository for business and technical metadata, the semantic layer ensures this metadata is applied seamlessly to downstream tools.

Challenges and Considerations

  • Vendor Lock-in: Many semantic layers are tightly integrated with specific platforms or BI tools, limiting their flexibility.
  • Scalability: As data volumes and complexity grow, maintaining consistent definitions across the semantic layer can become challenging.
  • Integration Effort: Organizations must carefully plan how to integrate semantic layers with existing tools and workflows, ensuring compatibility with data catalogs, governance frameworks, and data platforms.

In conclusion, the semantic layer represents a significant evolution in data governance, bridging usability, accessibility, and governance. By integrating semantic layers with data catalogs, organizations can develop a holistic approach to managing and maximizing the value of their data assets.

Examples of Semantic Layers

  1. Looker’s LookML

Description: LookML is Looker’s proprietary modeling language that enables users to define metrics, relationships, and dimensions for use in Looker dashboards and reports.

Key Features: Centralized metric definitions for consistent reporting. Ability to define joins, hierarchies, and custom calculations. Integration with Looker’s visualization engine for seamless exploration.

Strengths: Deep integration with Looker and minimal overhead for users accessing data.

Limitations: Proprietary to Looker, creating challenges in multi-tool environments.

2. Tableau Semantic Layer

Description: Tableau’s semantic layer allows users to define relationships, hierarchies, and aggregations within Tableau dashboards.

Key Features: Support for calculated fields and custom hierarchies.Data-blending capabilities to combine data from multiple sources.Integration with Tableau Prep for data transformations.

Strengths: User-friendly and accessible for business users.

Limitations: Lacks broader governance capabilities found in standalone catalogs.

3. Microsoft Power BI Data Models

Description: Power BI includes a semantic modeling layer that allows users to define relationships between tables, create measures, and establish hierarchies.

Key Features: Tight integration with the Microsoft ecosystem (e.g., Azure, Excel).DAX (Data Analysis Expressions) for advanced metric creation.Role-based security at the data model level.

Strengths: Familiar interface for Microsoft users and excellent for self-service analytics.

Limitations: Best suited for Microsoft-heavy environments.

4. dbt Metrics

Description: dbt introduced metrics as part of its semantic layer, enabling teams to define reusable metric logic at the transformation layer.

Key Features: Metric definitions tied directly to transformation logic. Integration with tools like Snowflake and Looker for downstream use.Strong community support due to dbt’s open-source nature.

Strengths: Open and extensible, making it a good choice for modern data stacks.

Limitations: Requires technical expertise for setup and maintenance.

5. Cube

Description: BI and DWH platform-agnostic semantic layer integrated with multiple BI platforms

Key Features: Integration with multiple data warehouses and BI tools. Both open-source and proprietary models

Strengths: Universal, can be used with multiple BI and AI tools

Limitations: Overkill for homogenous DWH with a single vendor

6. Snowflake (Cortex) semantic model

Description: Snowflake proprietary semantic layer

Key Features: Metrics, dimensions, relationship definitions, and supports suggested questions and verified queries. Can be generated and stored by Snowflake

Strengths: Included in Snowflake from the box, supported by Cortex NLQ and Cortex Search

Limitations: Created for AI and not supported by BI tools. Can be used only in Snowflake

Complementary Roles in Practice

  1. Data Exploration and Governance

Scenario: A retail company wants to govern its sales data across multiple teams and ensure consistent reporting.

Solution : Use a data catalog (e.g., Collibra) to manage metadata, ensure compliance, and enable data discovery. Employ a semantic layer (e.g., Looker LookML) to define metrics like “Net Sales” and make them available directly in reports.

2. Lineage and Impact Analysis

Scenario: A financial institution needs to track data lineage for regulatory purposes while empowering analysts with reliable metrics.

Solution: Use a data catalog (e.g., OpenMetadata) to track lineage from source systems to reports. Integrate a semantic layer (e.g., Tableau) to ensure metric definitions are consistent and accessible in dashboards.

3. Hybrid Integration

Scenario: A company using Snowflake for data storage and multiple BI tools for reporting wants centralized governance without sacrificing tool-specific functionality.

Solution: Use Snowflake Horizon as the central data catalog for governance and compliance. Leverage Tableau and Power BI semantic layers for tool-specific data exploration and reporting.

The Best Approach: A Layered Strategy

Rather than choosing between Horizon and an independent catalog, organizations can benefit from using both strategically.

  • Better Governance: Centralized catalogs provide oversight for compliance, while semantic layers ensure governance policies are implemented at the user level.
  • Improved Collaboration: Semantic layers bring governance closer to end-users, fostering collaboration between technical and business teams.
  • Enhanced Scalability: A hybrid approach allows organizations to scale data governance across diverse tools and platforms.

As you can see, there is no all-size-feet solution. By leveraging the advantages of a variety of tools and services, you can achieve what is best for your business now and in the future.

Originally published here.

要查看或添加评论,请登录

Andrew Mazur的更多文章

社区洞察