Spotlight on Unity Catalog

Spotlight on Unity Catalog

AI is a whole new world, and there’s a whole new dictionary to go with it. To read my future articles, join my network by clicking 'Follow'.

----------------------------------------------------------------------------------

More raw notes this week. Last week's blog was a spotlight on Langchain. This week, we're going to dive into Databricks Unity Catalog.

In my personal opinion, Unity Catalog is currently the most underrated product at Databricks. But it's actually one of the most important products. Why is it important? Many reasons: compliance, regulations, and responsible AI.

And, did you know that this wonderful product comes FREE with your Databricks instance?

It starts with Data Governance....

What is data governance?

Think about the company where you work. Many different people work there, such as managers, engineers, and salespeople. The company has records for each employee. The records include hiring date, salary, personal info, and performance reviews. That's a lot of sensitive information, right??

Data governance is like a set of rules and processes that ensure proper management of data.

This is the reason that we have data governance.?

Data Governance is like a set of rules and processes that ensure proper management of data. In this example above, we're looking at employee data. This is important and sensitive data. And we want to make sure that:

  1. Everything is Up-to-Date and Correct: Imagine if someone's salary info was wrong in the company's system - that would be a big problem! Data governance includes checking that the details in the HR records are correct. It also includes updating them when things change, like if an employee gets a promotion or raise.
  2. The Right People Have Access: Only certain employees in the company should be able to access HR records. For example, your manager can view your performance review, but your coworker cannot. Data governance sets up rules so that, for example, the IT guy can’t snoop around in an employee’s performance evaluations.
  3. The Information is Safe: HR records need to be protected from hackers or anyone who shouldn't see them.
  4. Laws and Rules are Followed: There are laws governing how employee information is used and protected, especially personal or sensitive data. Data governance makes sure the company follows these laws to avoid lawsuits or fines.

But Data Governance is only one third of the equation.

What is Unity Catalog?

So I just explained why the governance of data is important. But what about the other stuff? What about governance for analytics? What about governance for AI models??

Unity Catalog is a governance platform that helps companies govern all their data and AI in one place.

When I write “all”, I mean everything. This means structured and unstructured data, notebooks, dashboards, files and even AI models. And it doesn’t matter where they are stored. Unity Catalog works on Lakehouses across all Cloud platforms.?

Governance for data.... AND notebooks, dashboards, files and even AI models.

You can control who accesses the data and/or the AI, track its usage, and find it easily in your Databricks workspaces.

Unity Catalog offers the following key features:

  • Unified Governance. This means centralized access control. Set up security access rules in one spot, and use them everywhere.
  • Consistency Across Environments. Makes sure rules are the same across all areas of use.
  • Fine-grained access control. Organizations can decide who can access specific parts of their data, like catalogs, tables, or views. This helps make sure that only the right people can access sensitive information.
  • Integration with Open Standards. Works with various data formats, like Apache Iceberg, for flexibility.
  • Standards-compliant security model. Use a security model that follows standards (like ANSI SQL) to decide who can see or use different parts of the data. You can get granular when granting permissions. You can set them at the catalog level or the table level, for example.?
  • Built-in auditing and lineage. Keep automatic records of who did what with the data and where the data comes from. Have an audit log of what each user has done.?
  • Data search and discovery. Make data easy to find by labeling and describing it. Plus, a search tool to help locate it quickly.
  • System tables. Check your operational data using system tables. This includes audit logs and usage details for compliance.

Unity Catalog for AI Models

Unity Catalog also provides features for governing AI models. These features keep track of AI models and make sure they’re used in the right way.

Unity Catalog for AI Models.

  • Model registry. Unity Catalog provides a central place to register and manage AI models. Organizations use this to track all their AI models. Scientists and analysts have an easier time finding and using the models they need.
  • Model monitoring. Use Unity Catalog to monitor the performance of AI models in production. Organizations can find and fix AI model problems before any damage occurs. Unity Catalog can track a variety of model metrics, including accuracy, precision, recall, and F1 score. It can also monitor model drift, which is the change in a model's performance over time.
  • Model governance. Use Unity Catalog to follow best practices for model governance. For example, you can use it to review and approve models. Organizations can use this tool to ensure ethical development and deployment of AI models.

Unity Catalog and Responsible AI

So it seems like everyone's jumping on the big LLM (Large Language Model) bandwagon these days, right? It's like the trendy new gadget that all the cool kids want. But here's the kicker: just because we can whip up an AI that chats like your best buddy, doesn't mean we should let it loose without some ground rules. That's where responsible AI comes into play.

Think of it as the chaperone at the high school dance, making sure AI doesn't step on any toes or spill punch on the prom queen. It's all about making sure our AI pals are fair. They are not tripping over their own algorithms, and keeping everyone's secrets... well, secret. Because nobody wants a blabbermouth robot that spills the beans on your secret love for karaoke. So, yes, responsible AI is the name of the game if we want to keep things cool, safe, and fair in AI-land.

Here are some examples of how Unity Catalog makes AI more safe and fair:

  • Tracing data’s roots (lineage). Unity Catalog keeps track of where data comes from, including its history and journey. For instance, it can restrict access to sensitive data. Or it can make sure data is used only for its intended purpose to help AI make fair and unbiased decisions.
  • Improving data quality and reducing bias. It helps find and fix issues in the data that could make AI biased, which helps make AI fairer and more trustworthy.
  • Watching over AI models. It checks on AI models to make sure they’re working as expected. This helps avoid problems like biases changing over time or other unexpected outcomes.
  • Making AI use clear and responsible. It provides a central place to see who uses what data and for what, which helps ensure AI is used responsibly. It can create reports on data access and how AI models are used.

Is Unity Catalog the Holy Grail?

Yes, in my opinion, Unity Catalog is the Holy Grail of end-to-end lineage.?

In my world, the Holy Grail looks like a martini glass.

End-to-end lineage means that it can track the flow of data from its source to its destination. This includes all intermediate transformations and processing steps. It can also track the lineage of notebooks, workflows, and dashboards. You can also see how data is being used across the entire data science lifecycle.

Mind blown!

Let me illustrate how this can work. A data scientist is building an ML model to predict customer churn. They start by using Unity Catalog to identify the data sources that they need. For example, customer data, product data, and usage data.

They then import this data into a Databricks workspace. They create a notebook to clean and prepare the data. The data scientist then tracks the lineage of the data as it is transformed and processed. This includes tracking the following:

  • The source of the data
  • Any transformations that are applied to the data
  • The intermediate datasets that are created
  • The final dataset that is used to train the ML model

Once the data is ready, the data scientist uses Unity Catalog to generate a lineage report for the ML model. This report shows the entire data flow, from the source data to the trained model.

The data scientist uses the lineage report to understand how the model is working. They can identify any potential problems. If the model is not performing well, the lineage report can track down the source of the problem. Did an error come from the original source of the data? Or did it come from a notebook? Or did it come from an ML model?

Unity Catalog's lineage data is stored in a delta table in the UC metastore. This delta table stores the full history of recent lineage records and is near real time. Additionally, customers can query it through the standard SQL interface.

End-to-end lineage is important for a number of reasons. It can help data teams to:

  • Identify the sources of data quality issues. If a data quality issue is detected in a downstream table, end-to-end lineage can track back to the source of the issue. This can help to identify the root cause of the problem and take corrective action.
  • Audit data usage. End-to-end lineage can be used to track who has accessed and used which data, and when. This can be helpful for compliance and auditing purposes.
  • Support data governance. End-to-end lineage can be used to identify sensitive data and ensure that it is protected.
  • Improve data observability. End-to-end lineage can be used to track the flow of data through a data pipeline. It can identify any potential bottlenecks or errors.


About the author: Maria Pere-Perez

The opinions expressed in this article are my own. This includes the use of analogies, humor and occasional swear words. I currently work as the Director of ISV Technology Partnerships at Databricks. However, this newsletter is my own. Databricks did not ask me to write this. And they do not edit any of my personal work. My role at Databricks is to manage partnerships with AI companies, such as Dataiku, Pinecone, LangChain, Posit, MathWorks, Plotly, etc... In this job, I'm exposed to a lot of new words and concepts. I started writing down new words in my diary. And then I thought I’d share it with people. Click "Subscribe" at the top of this blog to learn new words with me each week.

Alka Gupta

Strategic Partnerships and GTM leader, Broadcom Software | Ex VMware, Yahoo!, Oracle, Sun Micro

3 个月

Saw the demo at Databricks Conf.. this is truly useful end-to-end. Use cases can be mind blowing indeed.

回复
Ruthie Senanayake (CAPM)?

Marketing Operations Manager @ Blueprint | CAPM Certified

8 个月

We agree! Check out our upcoming webinar on Unity Catalog: https://www.dhirubhai.net/events/7156023634510106624/about/

回复
Ruthie Senanayake (CAPM)?

Marketing Operations Manager @ Blueprint | CAPM Certified

10 个月

Such a great explanation of UC! Thanks, Maria!

Howard Levenson

Former General Manager-Area Vice President Databricks Federal, LLC at Databricks

1 年

Great job, Maria. Thanks for sharing.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了