Spotlight on Unity Catalog
Maria Pere-Perez
Databricks | Sr Director, AI Technology Partners | LinkedIn Top Voice
AI is a whole new world, and there’s a whole new dictionary to go with it. To read my future articles, join my network by clicking 'Follow'.
----------------------------------------------------------------------------------
More raw notes this week. Last week's blog was a spotlight on Langchain. This week, we're going to dive into Databricks Unity Catalog.
In my personal opinion, Unity Catalog is currently the most underrated product at Databricks. But it's actually one of the most important products. Why is it important? Many reasons: compliance, regulations, and responsible AI.
And, did you know that this wonderful product comes FREE with your Databricks instance?
It starts with Data Governance....
What is data governance?
Think about the company where you work. Many different people work there, such as managers, engineers, and salespeople. The company has records for each employee. The records include hiring date, salary, personal info, and performance reviews. That's a lot of sensitive information, right??
This is the reason that we have data governance.?
Data Governance is like a set of rules and processes that ensure proper management of data. In this example above, we're looking at employee data. This is important and sensitive data. And we want to make sure that:
But Data Governance is only one third of the equation.
What is Unity Catalog?
So I just explained why the governance of data is important. But what about the other stuff? What about governance for analytics? What about governance for AI models??
Unity Catalog is a governance platform that helps companies govern all their data and AI in one place.
When I write “all”, I mean everything. This means structured and unstructured data, notebooks, dashboards, files and even AI models. And it doesn’t matter where they are stored. Unity Catalog works on Lakehouses across all Cloud platforms.?
You can control who accesses the data and/or the AI, track its usage, and find it easily in your Databricks workspaces.
Unity Catalog offers the following key features:
Unity Catalog for AI Models
Unity Catalog also provides features for governing AI models. These features keep track of AI models and make sure they’re used in the right way.
领英推荐
Unity Catalog and Responsible AI
So it seems like everyone's jumping on the big LLM (Large Language Model) bandwagon these days, right? It's like the trendy new gadget that all the cool kids want. But here's the kicker: just because we can whip up an AI that chats like your best buddy, doesn't mean we should let it loose without some ground rules. That's where responsible AI comes into play.
Think of it as the chaperone at the high school dance, making sure AI doesn't step on any toes or spill punch on the prom queen. It's all about making sure our AI pals are fair. They are not tripping over their own algorithms, and keeping everyone's secrets... well, secret. Because nobody wants a blabbermouth robot that spills the beans on your secret love for karaoke. So, yes, responsible AI is the name of the game if we want to keep things cool, safe, and fair in AI-land.
Here are some examples of how Unity Catalog makes AI more safe and fair:
Is Unity Catalog the Holy Grail?
Yes, in my opinion, Unity Catalog is the Holy Grail of end-to-end lineage.?
End-to-end lineage means that it can track the flow of data from its source to its destination. This includes all intermediate transformations and processing steps. It can also track the lineage of notebooks, workflows, and dashboards. You can also see how data is being used across the entire data science lifecycle.
Mind blown!
Let me illustrate how this can work. A data scientist is building an ML model to predict customer churn. They start by using Unity Catalog to identify the data sources that they need. For example, customer data, product data, and usage data.
They then import this data into a Databricks workspace. They create a notebook to clean and prepare the data. The data scientist then tracks the lineage of the data as it is transformed and processed. This includes tracking the following:
Once the data is ready, the data scientist uses Unity Catalog to generate a lineage report for the ML model. This report shows the entire data flow, from the source data to the trained model.
The data scientist uses the lineage report to understand how the model is working. They can identify any potential problems. If the model is not performing well, the lineage report can track down the source of the problem. Did an error come from the original source of the data? Or did it come from a notebook? Or did it come from an ML model?
Unity Catalog's lineage data is stored in a delta table in the UC metastore. This delta table stores the full history of recent lineage records and is near real time. Additionally, customers can query it through the standard SQL interface.
End-to-end lineage is important for a number of reasons. It can help data teams to:
About the author: Maria Pere-Perez
The opinions expressed in this article are my own. This includes the use of analogies, humor and occasional swear words. I currently work as the Director of ISV Technology Partnerships at Databricks. However, this newsletter is my own. Databricks did not ask me to write this. And they do not edit any of my personal work. My role at Databricks is to manage partnerships with AI companies, such as Dataiku, Pinecone, LangChain, Posit, MathWorks, Plotly, etc... In this job, I'm exposed to a lot of new words and concepts. I started writing down new words in my diary. And then I thought I’d share it with people. Click "Subscribe" at the top of this blog to learn new words with me each week.
Strategic Partnerships and GTM leader, Broadcom Software | Ex VMware, Yahoo!, Oracle, Sun Micro
3 个月Saw the demo at Databricks Conf.. this is truly useful end-to-end. Use cases can be mind blowing indeed.
Marketing Operations Manager @ Blueprint | CAPM Certified
8 个月We agree! Check out our upcoming webinar on Unity Catalog: https://www.dhirubhai.net/events/7156023634510106624/about/
Marketing Operations Manager @ Blueprint | CAPM Certified
10 个月Such a great explanation of UC! Thanks, Maria!
Former General Manager-Area Vice President Databricks Federal, LLC at Databricks
1 年Great job, Maria. Thanks for sharing.