Databricks Lakehouse Platform. Planes
Aleksei Zhukov
Cloud Data Platforms | Microsoft Fabric, Azure and Databricks | Data for AI | Data Privacy | Data FinOps
It's usually normal to start with something basic and fundamental, but the topics suggested in the Databricks Exam Guide are so boring, generic, and basic that I decided to skip them. At least for now. I'll leave this as a prerequisite for the upcoming content:
Databricks Lakehouse Platform
● Describe the relationship between the data lakehouse and the data warehouse.
● Identify the improvement in data quality in the data lakehouse over the data lake.
● Compare and contrast silver and gold tables, which workloads will use a bronze table as a source, which workloads will use a gold table as a source.
Control vs Compute plane
On the high level Databricks Platform architecture is divided into two main components: the control plane and the compute plane (a.k.a. Data plane).
The term "plane", well-established in the IT and telecommunications industry, helps conceptually separate different functional aspects of the architecture. In Databricks, it delineates the distinct roles and responsibilities between managing the environment (control plane) and processing the data (data plane). These planes operate in parallel, each focusing on a specific aspect of the system. The control plane and compute plane in Databricks work simultaneously but independently.
Control Plane: Managed by Databricks, the control plane includes services that handle the management and orchestration of the Databricks environment. This includes the user interface, API services, job scheduling, cluster management, and data security features. The control plane does not have direct access to the customer's data, ensuring data privacy and security.
Compute Plane: In fact, there are two compite planes:
Classic Compute/Data Plane resides in the customer’s cloud account (e.g., AWS, Azure, Google Cloud) and is responsible for all data processing. This includes the compute resources (e.g., clusters), the data itself, and the networking infrastructure. The customer retains control over this plane, enabling customization and integration with existing cloud resources.
Within the same cloud region as the Classic plane, Databricks also operates a Serverless Compute Plane. This plane is similar to the classic compute plane but managed closer to the control plane. The primary distinction is that in a serverless setup, Databricks abstracts away the underlying infrastructure management, offering simplicity and reduced operational overhead for the user.
领英推荐
Network Connectivity Configuration (NCC) is crucial for ensuring secure and compliant data access and communication within the serverless environment.
Databricks manages the NCC for serverless compute, simplifying network security by automatically configuring and enforcing security rules that isolate the compute resources and control their network access. This includes:
For classic compute resources, a default virtual network (such as a VNet in Azure, VPC in AWS, or VPC in GCP) is automatically created within your cloud environment when deploying Databricks. This virtual network houses all the classic compute resources, ensuring they operate within a secure and isolated network environment.
Returning to the term "plane," the control plane and compute plane are isolated from one another to protect sensitive data (which resides in the data plane) while allowing the control plane to manage resources securely.
Remember, a Data samurai has no destination, only a path.