Data-sharing in Databricks an introduction
Data sharing has always been important, but traditionally, it’s been a one-way street—either you send data out, or you receive it. On top of that, there are challenges with security, ensuring everyone is on the same page with data formats, and managing all the manual processes. While many platforms offer data-sharing capabilities, I think Databricks stands out by making collaboration easier, more flexible, and secure—whether it’s within teams, companies, or across clouds.
With all these new possibilities for sharing data, it’s important to take a step back and think about how your organization should be sharing data. It’s not just about the tools, but also follow your company’s guidelines and best practices—especially when sharing data outside your organization. In this article, I’ll walk you through Databricks’ main data-sharing features, how they work, and when you should use them. Keep in mind that this is my understanding and personal opinions.
Setup: Prerequisites for Data Sharing
Before you start using any of the data-sharing features in Databricks, there are a couple of things you’ll need to set up first:
Delta Sharing: The Backbone of Data Sharing (GA)
Delta Sharing is the core feature that powers data sharing in Databricks, enabling secure data sharing across teams, organizations, and platforms. It's built on open standards like Apache Parquet and Delta Lake, which means it can be used to share data across cloud environments. Delta Sharing is the foundation for features like Clean Rooms and Marketplace, but it’s can also be used natively ?using the delta-sharing tab in the catalog for an easier data-sharing setup.
How It Works: Delta Sharing makes it possible to access data stored in Databricks without needing to replicate it. You can use it across multiple cloud environments (AWS, Azure, Google Cloud), and it’s enabled at the metastore level. This makes it ideal for cross-cloud, cross-organization sharing without duplicating data.
Using Delta Sharing Natively in the Unity Catalog Delta-Sharing UI
Once Delta Sharing is enabled in your Databricks metastore, you can use the Delta Sharing tab in the Unity Catalog UI for managing and sharing data with external partners and internal teams. This feature provides a centralized, secure way to control who can access your data and ensures compliance with governance policies.
How It Works: In the Unity Catalog Delta-Sharing UI, you can define and configure shares, select datasets, and assign permissions. This centralized interface gives you fine-grained control over data access. You can easily share data with other organizations or teams, and external parties can access the data securely without needing to replicate it.
When to Use It: Use this feature when you want a simple, intuitive way to share datasets across your organization or with external partners. The Delta Sharing UI allows for easy setup, management, and monitoring of shared data, making it an ideal choice when you want control over what data is shared and who can access it.
Marketplace: Deliver and Discover Data Products (GA)
The Databricks Marketplace is another exciting feature that enables organizations to list, share, and monetize their data products. It integrates seamlessly with Delta Sharing, making it easy to distribute data without duplicating it.
How It Works: Organizations can list their data products in the Marketplace, making them available to customers or collaborators. It works seamlessly with Delta Sharing, allowing for secure access without having to replicate the data. You can list products publicly or privately, giving you control over who can see and access your data.
Private vs. Public Listings: In the Marketplace, data products can be listed as private or public. Public listings are available to everyone, meaning that any user who accesses the marketplace can view and potentially access the data. Private listings, on the other hand, are more controlled—these are only accessible to users or organizations you explicitly grant access to, providing you with tighter control over your data.
When to Use It: Use the Marketplace when you want to share data products, either internally or with external customers/partners. It’s perfect if you’re looking to monetize your data, share research, or offer datasets and models as services. Learn more about Databricks Marketplace
领英推荐
Clean Rooms: Secure Collaboration Across Organizations (Public Preview)
A new personal favorite is one the horizon! Clean Rooms is a powerful feature in Databricks that enables secure collaboration across organizations. In the public preview phase, this tool is already impressing people by allowing multiple parties to share insights without exposing sensitive data. It’s a game changer for securely collaborating with external partners or clients and different granularity of the data. The other party may request permission to run notebooks on someone else’s data.
How It Works: In Clean Rooms, each participant can bring their own data, but initially, they can only see their own data and the schema of the other party’s data. This ensures that sensitive data stays protected. Participants can run code and queries on their own data, and also specify what type of queries the other party can run.
When to Use It: Use Clean Rooms when you need to collaborate on sensitive data analysis but want to ensure that each party’s proprietary data remains protected and gives fine access control on what granularity of the data the other party can see.
Lakehouse Federation: Query Data Across Multiple Cloud Environments
Lakehouse Federation allows you to query data across multiple cloud environments (AWS, Azure, Google Cloud) without needing to replicate or move the data.
How It Works: Lakehouse Federation enables you to query data stored in different cloud platforms (like AWS S3, Azure Data Lake, Google Cloud Storage) without needing to move it around or transform it. It simplifies cross-cloud analytics by abstracting the data access layer.
When to Use It: If you have data spread across multiple cloud environments and need to query it together, Lakehouse Federation is a perfect fit. It’s ideal for organizations with a multi-cloud strategy or anyone looking to simplify cross-cloud analytics. It’s also useful when doing initial analysis of third-party data before setting up more permanent solutions. Read More on Lakehouse Federation
REST API: Automate Data Sharing and Management (GA)
The REST API lets you interact with Databricks’ data-sharing features programmatically, which means you can automate tasks, manage permissions, and integrate Databricks’ capabilities into your custom workflows.
How It Works: The API provides a way to manage data shares, datasets, and permissions in Delta Sharing, Clean Rooms, and other Databricks features. It allows you to automate your workflows and integrate data sharing into your enterprise applications.
When to Use It: The REST API is ideal if you need to automate data-sharing tasks or integrate them into custom applications or workflows. It’s often used by teams that build custom applications or need high-volume integrations. Read More on REST API
Summary: Why Choose Databricks for Data Sharing
Databricks offers a wide range of features for sharing data securely and efficiently. Whether it’s Delta Sharing, Clean Rooms, Marketplace, Lakehouse Federation, or the REST API, Databricks has everything you need for data-sharing across teams, organizations, and cloud platforms.
That said, as your organization starts using these features, it’s important to make sure your data-sharing practices are in line with your companies rules and best practices and I will also strongly recommend understanding the different features to be able to make the best choice for your different use cases. ?
Have you used any of these features in your organization? Feel free to drop your thoughts below or reach out for a further discussion!
#DatabricksMVP #DeltaSharing #CleanRooms #Marketplace #LakehouseFederation #Databricks
?
Julia F?rde Very well written!
Data Architecture Senior Manager at Accenture
3 个月Julia F?rde Which features do you recommend wherein two different organisations are getting merged and they want to have consolidated reports on datasets belonging to two different organisations. Both organizations are on the Databricks platform.
Data & AI Governance | Change Management | Customer-Centric
3 个月Indeed, a compehensive overview of these features and data protection across cloud platforms is an important one to consider along with data access.