Making Sense of Unity Catalog in Databricks

Making Sense of Unity Catalog in Databricks

After traveling around talking about Databricks, I’ve realized that most people out there aren’t really that interested in or ready for the newest, most complex use cases or features. A lot of them are still wondering if it’s worth upgrading from Databricks Standard to Premium or maybe thinking Unity Catalog is just a way to see some extra info about their data.

The goal of this article is to explain what Unity Catalog actually does and why I think it’s more than just a catalog.

First, let’s look at a common definition of a data catalog:

“A data catalog is a collection of metadata, combined with data management and search tools, that help data users find the data that they need.”

As mentioned in an episode of NextGenLakehouse, Michelle talks about how catalogs are evolving from “just” containing metadata and search to something that also covers things like commits, extended governance, and even data sharing.


Why Unity Catalog? The Shift from Hive Metastore


Shows how managing governance have gone from workspace by workspace to a more centralized approach.


Unity Catalog is essentially the evolution of Databricks' Hive Metastore, and many businesses are dealing with technical debt because they started with Hive Metastore and haven't yet migrated to Unity Catalog.

The biggest difference between the two is the extended governance features. With Hive Metastore, governance was managed on a workspace-by-workspace basis. This meant that each workspace had to set its own permissions, manage access controls, and handle user management separately, making things fragmented and difficult to scale. It also made it challenging to have visibility and control across different workspaces within the same account.

In contrast, Unity Catalog centralizes governance. Instead of managing permissions in each workspace individually, you can manage them at the Metastore (a grouping of more workspaces), ensuring that access control and security policies are consistent across all workspaces that share the same Metastore. This centralized model makes it easier to ensure consistent governance, security, and compliance across the entire environment.


Centralized Governance & Fine-Grained Access Control


Show an example of how workspaces and catalogs can be connected.

The core of Unity Catalog's governance is its ability to centralize control, which is great when you're dealing with multiple workspaces, clouds, and systems. Before Unity Catalog, permissions were tied to individual workspaces. Now, permissions are applied directly to the data, so different workspaces can access the same data while still having control over what they can see or do with it.

Granular Permissions: Unity Catalog allows you to set permissions at a granular level—on catalogs, schemas, tables, and even individual columns. You can also define column masking, allowing you to control which users can access sensitive data.

Easy Overview: Unity Catalog gives an easy and intuitive way to see what permissions are set on the different data assets within the catalogs.

Cross-Workspace Governance: Security and access control policies are consistent across all workspaces that share the same Metastore, making governance easier to manage, even in multi-cloud setups.


Unified Metadata & Lineage Tracking

As your data grows, managing metadata can become a real challenge. Unity Catalog makes it easier by centralizing metadata storage and giving you more context about your data.

You can search across data in your catalogs in a intuitive way and Unity Catalog gives you a deeper look into each table through the Insights and Details panes.

  • Insights Pane: See things like how often a table is queried, what the most popular queries are, what tables are commonly joined, and who interacts with the table the most.
  • Details Pane: Find more specific info, like the table ID, ownership, whether it’s managed or external, when it was created, and more.
  • History Pane: Reads transaction logs for you and shows the changes in a simple, easy-to-understand table. You can see who made what changes, when they happened, and even roll back to a previous version with just a few clicks.

On top of metadata, Unity Catalog also provides built-in lineage tracking, allowing you to visualize and track the entire data flow across your environment.

  • Track lineage at multiple levels, from entire datasets to individual columns.
  • See how data is transformed through tables, views, and models.
  • Visualize the lineage as a graph to better understand how data moves across workflows, notebooks, and jobs.


An example of lineage graph in Databricks

Managing & Sharing Data Across Workspaces, External Sources, and Integrations


When using external sources you data will be stored in for example Azure not in Databricks

Unity Catalog allows you to manage data across multiple workspaces and external sources, such as AWS S3, Azure ADLS, and other cloud platforms, without having to move your data into Databricks. Your data remains securely stored in your cloud environment, ensuring compliance with your data governance policies while enabling easy access and sharing.

Additionally, Unity Catalog integrates with popular data catalogs and governance tools like Immuta, Collibra, Atlan, and Microsoft Purview, helping you build a robust governance model without having to get rid of existing catalogs.

Within Databricks, workspaces can share data through Unity Catalog, applying different levels of access control as needed—whether read-only or full access. This flexibility enables easier collaboration across teams and environments without duplicating data.


Advanced Data Sharing (Delta Sharing & Clean Rooms)

Unity Catalog is a requirement for enabling Delta Sharing, which lets you securely share live data with other organizations, even if they’re not using Databricks. Instead of making static copies, you can grant direct access while keeping full control over permissions.

Inside Unity Catalog, you can manage Delta Sharing, set up shares, and define who can access what data. This is also where you interact with Clean Rooms, allowing multiple parties to collaborate on datasets securely without exposing raw data.

Databricks also provides a Marketplace, where you can list or sell data assets, making it easier to share data externally. More detailed information on how data can be shared between workspaces, metastores, or across clouds can be found in one of my previous articles: Data Sharing in Databricks: An Introduction.


Support for ML Models

Unity Catalog integrates with MLflow, enabling it to serve as a unified model registry. This integration allows you to apply the same governance policies to both data and machine learning models.

Key benefits include:

  • Unified Governance: Consistent access control and security policies for both data and models.
  • Model Versioning: Track and manage multiple versions of models.
  • Auditability: Full visibility into who created, modified, and accessed models.


Multi-Cloud Support


Unity Catalog makes it easier to manage data when it’s spread across different clouds. Whether your data is in AWS, Azure, or GCP, Unity Catalog helps you apply the same governance rules everywhere.

  • Multi-Cloud Governance: Apply a single set of governance rules across different clouds, ensuring consistent security and access control.
  • Lakehouse Federation: Connect and manage data across Lakehouses in different clouds, allowing querying and sharing.

For organizations using multiple clouds, Unity Catalog simplifies data governance, ensuring security and compliance across all environments.



If you’re on the fence, in my opion Unity Catalog is definitely a no-brainer if you are using r considering Databricks. I would go as far as saying that Unity catalog is a great reason to start using Databricks in your business.

#Databricks #DatabricksMVP #UnityCatalog #Governance #DeltaSharing #AzureDatabricks

Camilla Kiernan

Data Warehousing | ML | GenAI | @ Databricks

2 周
赞
回复
Joe Brown

Account Executive @ Databricks | The Data & AI Company

2 周
赞
回复
Matthias Ingerfeld

Area VP & TechGM Field Engineering for Central EMEA at Databricks

2 周

Great overview ! Julia F?rde

Daniel S?derstr?m

Databricks | The Data Intelligence Platform

2 周

Insightful as always Julia. Thank you! ??

Kumaravel Vivekanandam

Technology, Data & Analytics Executive | Drive Innovation, Profitability & Business Value | Improve Agility, Efficiency & Speed | Generative AI | Create Visionary Leaders | Build High-Performing Teams | Servant Leader

2 周

These capabilities makes Databricks unique along with its AI/ML capabilities. Data Lineage within the Lakehouse or the Data Lake is a good feature but doesn’t provide the capability to trace back to the source and an integration with source systems would be a welcome feature.

要查看或添加评论,请登录

Julia F?rde的更多文章

  • Understanding Key Layers of the Azure Databricks Account

    Understanding Key Layers of the Azure Databricks Account

    When I talk to teams and organizations about Azure Databricks, one of the common gaps in their understanding is how the…

    12 条评论
  • Data-sharing in Databricks an introduction

    Data-sharing in Databricks an introduction

    Data sharing has always been important, but traditionally, it’s been a one-way street—either you send data out, or you…

    7 条评论
  • Key Differences Between Databricks Standard and Premium tiers

    Key Differences Between Databricks Standard and Premium tiers

    Recently, I was asked to explain the difference between Azure Databricks Premium and Standard. While a quick search…

    7 条评论
  • ??ONE OF DATABRICKS' STRONGEST SIDES: ACQUISITIONS??

    ??ONE OF DATABRICKS' STRONGEST SIDES: ACQUISITIONS??

    One of the things that fascinates me most about Databricks is their ability to identify and acquire companies that…

    20 条评论

社区洞察