Modern Data Catalogs and Semantic Layers
Andrew Mazur
Senior Business Development Manager @ DataArt | Driving Technology Transformation
In today's data-driven world, organizations are inundated with vast amounts of information, making effective data management crucial. As we delve into the landscape of modern data catalogs, it's essential to clarify the terminology and understand the various types of catalogs available. This article will explore the distinctions between technical and business data catalogs, the role of semantic layers, and the unique offerings of platforms like Snowflake, particularly its Polaris and Horizon catalogs.
Understanding the Landscape of Data Catalogs
Before diving into specifics, let’s align on terminology. The term “Data Catalog” is often used inconsistently across the industry. To clarify, we’ll distinguish different types of data catalogs. Later, we’ll also introduce the role of semantic layers.
A key question in the context of Snowflake arises: “Why does Snowflake have two data catalogs—Polaris and Horizon?”
The answer lies in the industry’s use of one term for two (or even three) distinct types of services, some of which serve catalog-like functions but are technically different, plus another that is not a Data Catalog but serves a similar purpose.
The Four Key Layers of Data Catalogs
To better understand how data catalogs function, we can broadly distinguish the following levels:
1.Data Definition and Manipulation
This layer manages data storage, enabling transactional capabilities, data and schema versioning, and other storage-related tasks. It is typically included in the database engine and translates Data Definition instructions into lower-level instructions back to INFORMATION_SCHEMA and the data itself or Data Lake houses.
2. Data Access Control
This layer governs access at various levels. It may include simple privilege granting, role-based access control (RBAC), tag-based security policies, row- and column-level security, and aggregation policies. This level is typically integrated into database engines and Data Fabrics.
3. Data Asset Management.
This encompasses a wide range of tasks:
4. Semantic Layer
Enriches standard data catalogs with contextual business information, including metrics, descriptions, dimensions, and hierarchical relationships. This layer enhances analytical capabilities by embedding business logic directly into the data AP s
From the user’s perspective, it can be described as follows:
A technical data catalog (sometimes called metastore, not precisely the correct naming for modern catalogs) is closely linked to the data itself, storing most of its metadata alongside the data. A business data catalog, on the other hand, is usually a separate data governance layer that integrates with it. Finally, the semantic layer can be a part of the analytical engine and a separate semantic layer and potentially extend the technical data catalog with some capabilities of the data catalog.
The significant difference between the business data catalog and the semantic layer is that the latter should be part of the data API, enabling seamless integration without requiring external services.
Polaris: Data Definition, Manipulation, and Access Control Service
Polaris falls under the first two categories, acting as a technical catalog for your data. Every data management system has its catalog, ranging from simple file lists to complex systems like Polaris, which tracks current and historical data partitions and changes.
A unique feature of Polaris is its ability to delegate data manipulation to other catalogs. For instance, you can manage data across AWS Glue or Azure OneLake from a single place without relying on external services. Unlike traditional built-in catalogs, Polaris operates independently, allowing users to choose their preferred computing resources, such as Snowflake or Databricks. This flexibility minimizes overhead and optimizes computing and storage costs.?
Polaris isn’t the only option. Unity, Glue Data Catalog, and Hive also support Iceberg tables, offering alternatives with varying strength. The list of services is not final and will continue to grow. Snowflake claims that Polaris’s power is its efficiency, enabling cost reductions in computing and storage when using Snowflake. Further details can be found in this article on building an open data lakehouse and this piece on unifying Iceberg tables.
A key advantage of Polaris, similar to Unity, is its open-source nature and relative independence. This makes it easier to establish a unified interoperability layer for a full-blown Data Mesh architecture or its components with minimal effort. While "open-source" might not be the most precise term in this context, “independent” is more accurate. However, this flexibility is limited to data processors that support Iceberg. Looking ahead, choosing non-compatible solutions is becoming increasingly unlikely. Although ETL processes will not be entirely eliminated, we can significantly reduce data transfer and additional storage costs.
Data Catalogs for Asset Management
Business data catalogs focus on cataloging and exploration, managing information that may include technical metadata (e.g., tables, files, lineage diagrams, and technical data quality) but concentrates on treating data as assets. Objects in a data catalog need meaningful names and business descriptions to serve their purpose. So, integrating a business glossary linked to technical metadata is an excellent way to make data more comprehensible to its users.
Normally, a business data catalog acts as a standalone or proprietary built-in service and has some specific components, such as a search engine, GraphQL API, developed attribute set, and classification capabilities.
The purpose often determines usage. For instance, using a Data Asset Management service tightly coupled with a specific Data Platform usually limits its ability to serve broader business needs, as assets are rarely confined to a single solution. Examples include Azure Preview and Snowflake Horizon. AWS Glue, however, falls short as a comprehensive Data Asset Management service.
The decision often boils down to vendor lock-in versus better integration. It might seem straightforward, but the trade-off isn’t always obvious. Balancing both worlds is often the best approach.
Let’s compare platform-specific data catalogs to independent ones and analyze their respective advantages. I’ll use Snowflake Horizon and Collate (OpenMetadata) as examples.
Snowflake Horizon
Initially introduced as a Snowflake-specific data catalog, this solution offers deep integration with Snowflake, enabling rapid onboarding of new features. Recently, support for the Polaris catalog was added, allowing it to be indexed and managed. When using Horizon with Polaris, features such as tags, column-level security, and other integrations become available.
A key advantage is the bidirectional connection: There are no boundaries between the catalog and the platform, seamlessly ensuring that changes in the catalog immediately reflect in platform behavior. It is simple, user-friendly, cost-effective, and easy to start, allowing you to maximize the service.
However, incorporating additional sources with Horizon is challenging:
Collate (OpenMetadata) offers an alternative. Originating at Uber, it evolved into an open-source solution, blending community-driven and proprietary features. This dual nature provides flexibility between enterprise support and startup-style innovation.
Key features include:
1.Data Exploration and Discovery
2. Data Quality
3. Platform Integration
4. AI-Driven Features
While Horizon provides overlapping features such as ERDs and glossary capabilities, it excels in Snowflake-specific functionalities like compute resource management and role control. Its AI capabilities are broader in areas like data classification, quality control, and access history.
Choosing between these solutions involves trade-offs, but picking just one isn’t necessary. Adding a dedicated layer for managing data assets is feasible. Horizon, in particular, provides robust compliance, privacy, and security features, making it a solid choice for a security layer. While maintaining dual systems requires additional effort for integration and user training, this approach can be implemented organically, especially since Horizon is built into the Snowflake ecosystem.
Planning and Function Mapping
Effective planning is essential when managing multiple tools. Below is a feature comparison to guide decision-making:
This list is not exhaustive and can be expanded based on specific needs. Certain features may transition to different levels as new data platform components or features are adopted.
Semantic Layer and Its Role in Data Governance
The semantic layer is a class of service components that can complement or partially overlap with data catalogs' functions. It serves as an enhanced data catalog that focuses on data explanation and accessibility.
A simple semantic layer could be as basic as a data mart with defined foreign key relationships between fact and dimension tables. Acting as an intermediary layer, it describes the data and provides additional metadata. Historically, this layer was embedded within BI systems, but as organizations mature in their data governance practices, it is evolving into an independent component of the data stack.
Key features of a semantic layer include:
1.Metric Definitions
2. Dimension Hierarchies and Relationships
3. Business Glossary and Terminology
4. Proximity to Data
5. Improved Data Governance
6. Enhanced User Experience
By integrating directly with data and visualization tools, the semantic layer empowers non-technical users to interact with data more intuitively. For example:
Future Impact of the Semantic Layer
As the semantic layer evolves, it has the potential to reshape the data governance landscape significantly in several ways:
Integration with Data Catalogs
While the semantic layer and data catalogs share overlapping functionalities, they serve different purposes and can complement each other effectively:
By combining these tools, organizations can balance governance with accessibility. For instance, while the data catalog serves as a central repository for business and technical metadata, the semantic layer ensures this metadata is applied seamlessly to downstream tools.
Challenges and Considerations
In conclusion, the semantic layer represents a significant evolution in data governance, bridging usability, accessibility, and governance. By integrating semantic layers with data catalogs, organizations can develop a holistic approach to managing and maximizing the value of their data assets.
Examples of Semantic Layers
Description: LookML is Looker’s proprietary modeling language that enables users to define metrics, relationships, and dimensions for use in Looker dashboards and reports.
Key Features: Centralized metric definitions for consistent reporting. Ability to define joins, hierarchies, and custom calculations. Integration with Looker’s visualization engine for seamless exploration.
Strengths: Deep integration with Looker and minimal overhead for users accessing data.
Limitations: Proprietary to Looker, creating challenges in multi-tool environments.
2. Tableau Semantic Layer
Description: Tableau’s semantic layer allows users to define relationships, hierarchies, and aggregations within Tableau dashboards.
Key Features: Support for calculated fields and custom hierarchies.Data-blending capabilities to combine data from multiple sources.Integration with Tableau Prep for data transformations.
Strengths: User-friendly and accessible for business users.
Limitations: Lacks broader governance capabilities found in standalone catalogs.
3. Microsoft Power BI Data Models
Description: Power BI includes a semantic modeling layer that allows users to define relationships between tables, create measures, and establish hierarchies.
Key Features: Tight integration with the Microsoft ecosystem (e.g., Azure, Excel).DAX (Data Analysis Expressions) for advanced metric creation.Role-based security at the data model level.
Strengths: Familiar interface for Microsoft users and excellent for self-service analytics.
Limitations: Best suited for Microsoft-heavy environments.
4. dbt Metrics
Description: dbt introduced metrics as part of its semantic layer, enabling teams to define reusable metric logic at the transformation layer.
Key Features: Metric definitions tied directly to transformation logic. Integration with tools like Snowflake and Looker for downstream use.Strong community support due to dbt’s open-source nature.
Strengths: Open and extensible, making it a good choice for modern data stacks.
Limitations: Requires technical expertise for setup and maintenance.
5. Cube
Description: BI and DWH platform-agnostic semantic layer integrated with multiple BI platforms
Key Features: Integration with multiple data warehouses and BI tools. Both open-source and proprietary models
Strengths: Universal, can be used with multiple BI and AI tools
Limitations: Overkill for homogenous DWH with a single vendor
6. Snowflake (Cortex) semantic model
Description: Snowflake proprietary semantic layer
Key Features: Metrics, dimensions, relationship definitions, and supports suggested questions and verified queries. Can be generated and stored by Snowflake
Strengths: Included in Snowflake from the box, supported by Cortex NLQ and Cortex Search
Limitations: Created for AI and not supported by BI tools. Can be used only in Snowflake
Complementary Roles in Practice
Scenario: A retail company wants to govern its sales data across multiple teams and ensure consistent reporting.
Solution : Use a data catalog (e.g., Collibra) to manage metadata, ensure compliance, and enable data discovery. Employ a semantic layer (e.g., Looker LookML) to define metrics like “Net Sales” and make them available directly in reports.
2. Lineage and Impact Analysis
Scenario: A financial institution needs to track data lineage for regulatory purposes while empowering analysts with reliable metrics.
Solution: Use a data catalog (e.g., OpenMetadata) to track lineage from source systems to reports. Integrate a semantic layer (e.g., Tableau) to ensure metric definitions are consistent and accessible in dashboards.
3. Hybrid Integration
Scenario: A company using Snowflake for data storage and multiple BI tools for reporting wants centralized governance without sacrificing tool-specific functionality.
Solution: Use Snowflake Horizon as the central data catalog for governance and compliance. Leverage Tableau and Power BI semantic layers for tool-specific data exploration and reporting.
The Best Approach: A Layered Strategy
Rather than choosing between Horizon and an independent catalog, organizations can benefit from using both strategically.
As you can see, there is no all-size-feet solution. By leveraging the advantages of a variety of tools and services, you can achieve what is best for your business now and in the future.
Originally published here.