Cluster configuration on Databricks best practices
Santos Saenz Ferrero
B.Comp.Sc. Cloud Data & Platform Engineer at Axpo Group
In this article it’s gonna be explained the different compute options available in Databricks and which one to choose depending on your needs. Also it’s gonna be explained best practices for managing the compute, and efficient configuration options for each compute depending on your workloads. Note that this article was written during November 2023, and Databricks Lakehouse platform features changes dramatically over time. Hope you enjoy it?:)
An Azure Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics and machine learning. In Databricks there are available these types of clusters: All-Purpose Compute, Job Compute, SQL Warehouse and Pools.
In this article it’s not gonna be explained how to deploy these compute resources. Compute on Databricks can be deployed through the UI, Databricks CLI, and Databricks REST API [1]. It is recommended to use Terraform or other IaC tools for deploying, versioning and changing infrastructure at scale through multiple clouds. This IaC tool interacts with the cloud providers through REST API. The following it’s a link for the documentation of the clusters REST API Clusters API | REST API reference | Azure Databricks.
1. Job?Cluster
Job Clusters in Databricks are transient clusters specifically created for running jobs. For running non-interactive workloads are the preferred option because of its cost efficiency, as the compute it’s only used for the duration of the job, after which it will be terminated [1]. Databricks units for All Purpose Compute are charged higher than for Job Compute [2]. In a Job Cluster every job runs in a new cluster, obtaining an isolated environment where resources and configuration for the cluster are dedicated to that specific job, reducing the risk of resource contention with other workloads. Job Clusters allow you to configure cluster settings, such as the number and type of workers and driver nodes, auto-scaling, libraries and environment settings for a specific job. This fine-grained control can be advantageous for tasks with specific requirements. These cluster settings will be explained later. Job Clusters can be integrated with the Job Scheduler to have scheduled recurring jobs, and also Job Clusters are suitable with one-off data processing jobs.?
Nevertheless, for interactive execution and development, Job Clusters are not a good fit, due to the starting time needed for the cluster to begin running the workload, as the starting time it’s quit important because the developer it’s waiting this time [3].
2. All Purpose?Cluster
All Purpose Clusters are mainly used to analyze data interactively using databricks notebooks. Multiple users can share these clusters to do collaborative interactive analysis, and are a better fit to this purpose than Job Clusters because after performing a workload the cluster it’s not terminated immediately, it’s terminated after the time specified in the terminate after field, from the cluster configuration [3]. Therefore not having start times as in the previous case, if users are using the cluster simulatenously, and is not passing by the time configured for termination in the if inactive field, saves time for the technicians to develop. Furthermore, All Purpose Cluster got the same configuration options as with Job Cluster, having also fine-grained control, which can be advantageous for tasks with specific requirements. All-Purpose Compute clusters were designed to be versatile and handle various tasks within the same cluster. This flexibility makes it easier to allocate resources as needed for different workloads.
3. SQL Warehouse
Another cluster type it’s SQL Warehouse, a general compute for SQL queries, visualizations, and dashboards that are executed against tables in the Lakehouse. Within Databricks SQL Warehouse, these queries, visualizations, and dashboards are developed and executed using SQL editor [4]. For interactive SQL workloads, a Databricks SQL warehouse is the most cost-efficient engine. Configuration options are explained later.
4. Pools
Azure Databricks pools are a set of idle, ready-to-use instancest[3]. When cluster nodes are created using the idle instances, cluster start and auto-scaling times are reduced. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters attached to a pool can use that pool’s idle instances. Cluster pools should be used when a starting time that is significantly shorter than a full cluster start would be acceptable, as Databricks pools reduce cluster start and auto-scaling times at the cost of having ready-to-use instances [5]. Configuration options are explained later.
5. Compute?Policies
Compute Policies it’s a feature attached to workspaces that enables workspace admins to limit users or groups compute creation permissions based on a set of policies. Compute policies require at list Premium Plan in the workspace [6]. Policies provides capabilities for limiting users to creating clusters with prescribed settings, limiting users to creating a certain number of clusters, simplify the UI for workspace users when creating clusters by setting or hiding some options and enforce cluster-scoped library installation [7].
When creating Compute Policies, Policy Families can be used. Policy Families are policy templates with pre-populated rules designed to address common compute use cases. The policy rules for your policy are inherited from the policy family, and these can be overridden and you can provide your own policy rules. You may have previously added compute-scoped libraries using init scripts, it is recommended to use compute policies instead of init scripts in cluster configuration to install libraries. By default non-admin users of the workspace must be granted permissions on a policy for them to have access to the policy, unless they have unrestricted cluster creation permissions that allows them to create fully configurable compute resources. You can monitor the policy’s adoption by viewing the compute resources that use the policy. From the Policies page, click the policy you want to view, then click the Compute or Jobs tabs to see a list of resources that use the policy. As the clusters policies definition is a long and an extended topic to address, it is not gonna be explained in this article.
Cluster policies can be created using the UI, CLI or REST API. I the following link it’s provided the cluster policies REST API Cluster Policies API | REST API reference | Azure Databricks. As with clusters, for managing policies it is recommended to use Terraform or other IaC tools, as they provide capabilities for deploying, versioning and changing infrastructure at scale through multiple clouds.
6. Job Cluster and All Purpose Cluster configuration options
When creating an All Purpose Cluster or a Job Cluster there needs to be configured the following options: the compute policy, access mode, performance options, diver and workers, tags and a few other advanced options.
6.1 Cluster Access?Modes
Cluster access mode is a security feature that determines who can use a cluster and what data they can access within the cluster, along with some capabilities and limitations for each mode. When choosing an access mode there needs to be considered the security, efficiency and flexibility of each mode that best fit your needs. There are currently three modes available: single user, shared and no isolation shared.
Single user mode got support for Unity Catalog, the unified governance solution provided by Databricks. Can be assigned and used by only a single user. Got support for Scala, Python, SQL and R. It has support for RDD API, DBFS mounts, init scripts and external libraries. Databricks Runtime ML is also supported.
Nonetheless single user mode doesn’t get support for legacy table ACL. To read from a view, you must have select permissions on all referenced tables and views. Dynamic views are not supported. Cannot be used to access a table that has a row filter or column mask. When used with credential passthrough, Unity Catalog features are disabled. Cannot be used to query tables or views created by Unity Catalog-enabled Delta Live Tables pipelines.
Single user mode should be used for workloads where there needs to be isolated a user’s environment from other users. For example, you might use single user mode for a cluster that is only used by a single user, that is created for an automated job by a scheduler, or a cluster that is executing a one time long running job.
Shared mode can be used by multiple users with data isolation among them, it requires a premium plan. Got support for Unity Catalog. Got support for SQL, Python with DBR 11.1 and above, and Scala on Unity Catalog enabled clusters for DBR 13.3 and above. It has support for dynamic views and legacy Table ACL, unlike single user mode. Init scripts and external libraries only got support in Databricks Runtime 13.3 LTS with Unity Catalog enabled cluster. Nonetheless it doesn’t get support for the RDD API and DBFS mounts. When used with credential passthrough, Unity Catalog features are disabled. Databricks Runtime ML is not supported. Custom containers are not supported and Spark-submit jobs are not supported. Due to user isolation, Scala code cannot access the Spark Driver JVM internal state nor access system files. Additionally, SparkContext and SQLContext classes and their methods are not available. Cannot use Hive or Scala UDFs. Python UDFs are only supported in Databricks Runtime 13.2 and above.
The shared access mode should be used for most workloads. It is the most flexible and efficient mode to use. Use other modes if you need specific support for some features that are not available in shared access mode.
No isolation shared mode can be used by multiple users with no data isolation among them. Doesn’t have support for unity catalog. Can be used with SQL, Scala, Python and R. Got support for RDD API, DBFS mounts, init scripts, libraries and Databricks Runtime ML. Nonetheless it doesn’t get support for Dynamic Views, legacy table ACL and credential passthrough.
No isolation shared mode should be used if you need a shared environment, and you need a feature not available in shared mode, like RDD API, DBFS mounts, init scripts, libraries and Databricks Runtime ML.
6.2 Cluster Performance Options
For the performance options of the cluster it needs to be configured the Databricks Runtime Version, and if Photon acceleration should be used.
Databricks Runtime is an optimized and pre-configured runtime environment for running data engineering and data science workloads on Apache Spark. It contains Delta Lake, Java, Scala, Python, and R libraries. It also contains Ubuntu and its accompanying system libraries. It contains GPU libraries for GPU-enabled clusters. It is composed of Azure Databricks services that integrate with other components of the platform, such as notebooks, jobs, and cluster management [1].
For an All Purpose Cluster it is recommended to use the latest Databricks Runtime version. This will ensure you get the latest packages and versions with the most up to date SDK and features. For job clusters, as they are usually used for running production workloads, the Long Term Support Databricks Version is recommended as it will ensure your production workloads don’t run into incompatibility issues, by having a version that is going to be continuously patched and fixed. For advanced machine learning use cases, consider the specialized Databricks Runtime Machine Learning version.
Photon is a high-performance Azure Databricks-native vectorized query engine that runs your SQL workloads and DataFrame API calls faster to reduce your total cost per workload. Should be used for both All Purpose Clusters and Jobs Clusters as the cost of the DBU with photon enabled and not enabled is the same. More details about Photon features can be found in the following link What is Photon??—?Azure Databricks | Microsoft Learn.
6.3 Driver and workers nodes instance?type
A cluster consists of one driver node and zero or more worker nodes. You can have separate cloud provider instance types for the driver and worker nodes. There are different families of instance types: General Purpose, Memory Optimized, Storage Optimized, Compute Optimized, GPU Optimized and Disk Cache Accelerated. Each of these families facilitates particular use cases. After choosing an instance type it is recommended to check the metrics of your workloads, to validate that the instance choosed it’s the most efficient for the workloads.
General Purpose instance type is the most well rounded and balanced option, offering a good combination of compute, memory and networking resources. Instances of this family are versatile and can be used for a wide variety of workloads. It is recommended using instances of this family for a shared All Purpose Cluster, for developing and performing interactive workloads, as these workloads have different needs depending on the feature that is developing each engineer of the data team using the cluster. Furthermore, it is recommended to use General Purpose instance types for Workflow Jobs that perform both wide and narrow transformations, as compute and memory are both important when performing these workloads and then there needs to be used a balanced compute.
Memory-optimized instances type in Databricks are compute resources that are specifically designed to excel in memory-intensive workloads. These instances provide high memory-to-CPU ratios for clusters. These instance types are well suited for data analysis workloads where it needs to be manipulated and analyzed large datasets and perform complex aggregations, as more data can fit in memory and less data is shuffled. These instances are also useful for data engineering workloads for which data requires a lot of space to fit in memory, and that have wide transformations with the consequent shuffles. Furthermore, these instance types can be useful for machine learning workloads that require training models that run iterative memory-intensive algorithms on data.
Storage-optimized instances type are ideal for Delta use cases, as these are custom built to include better caching and performance when querying Delta tables. They are well suited for the same use cases as with memory optimized instance types, but specifically when using data from Delta Tables.
Compute optimized instances type in Databricks are a category of compute resources designed to excel in CPU-intensive workloads. These instances offer a higher ratio of CPU power to memory, making them ideal for tasks that require substantial computational capacity and processing speed. Compute-optimized instances are valuable when you need to process data quickly and efficiently, especially for tasks involving complex calculations and narrow data transformations that don’t perform shuffles and that need to perform computations in a parallel way.
GPU-optimized instance types in Databricks are specialized compute resources that are designed to take advantage of Graphics Processing Units (GPUs). GPUs are highly parallel processors that excel in tasks that can be broken down into many smaller, parallel computations. These GPU-optimized instances are well-suited for workloads that require intensive parallel processing, such as machine learning, deep learning, and other computationally demanding tasks [8].
General Purpose, Memory Optimized and Storaged optimized instance types are available for families of instance types with Disk Cache Accelerated, a.k.a Delta Cache Accelerated. These clusters automatically use the Disk Caching feature of Databricks, and creates copies of remote parquet files being used in the workloads. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. This feature differs from Apache Spark Cache. These instance types have a little bit more of memory than normal instance types for leveraging better the Disk Cache Accelerated feature, and so there price it’s a bit higher. Nevertheless should be the choosed instance type for workloads that read repeatedly from the same remote files or that read both the same remote and local files repeatedly. Data analysys workloads are a typical type of worload that benefits from these instance type as they usually read repeatedly from the same files. Machine learning workloads are another typical use case where this instance type is usefull as this workflows usually read the same remote files repeatedly for the training of the models. For automated jobs this instance type probably shouldn’t be choose as this workflows usually read the remote source files once, and the cluster it’s terminated after finishing the job [9].
6.4 Spot Instances
The worker nodes can be leveraged by using spot instances. This brings the possibility to use free resources in the cloud paying a lower price for allocating the worker nodes, however if the cloud demand is increased your nodes will be deallocated. This is very useful for non critical workloads whenever you can afford to lose some of the work done. For carefully planning the maximum price rate you want to have for your spot instances, availability rate for spot instances based on the maximum price rate that you want to pay for them can be checked from the virtual machine page in Azure. Worker nodes should be deployed as spot instances when possible for lowering costs [10].
6.5 Autoscaling
Autoscaling is a feature that allows your Databricks cluster to automatically adjust its number of worker nodes based on the current workload demand, using the minimum and maximum number of worker nodes selected during cluster creation. The primary goal of autoscaling is to optimize resource utilization, reduce costs, and ensure that your cluster can handle varying workloads efficiently. For standard plan workspaces it is used standard autoscaling, while for premium and enterprise workspace plan the cluster uses optimize autoscaling. The difference in behaviour for both types of autoscaling can is explained in the following resource Create a cluster?—?Azure Databricks | Microsoft Learn. Autoscaling should be used with varying workloads with careful planning, checking the cluster metrics to check the use of nodes for fine tuning of minimum and maximum number of worker nodes.
It can often be difficult to estimate how much disk space a particular job will take. Nevertheless Azure Databricks automatically enables autoscaling local storage on all Azure Databricks clusters.
For a workload used for training a machine learning model, autoscaling should be avoided, since this machine learning jobs will often consume all available nodes, in which case autoscaling will provide no benefit. Furthermore machine learning workloads usually access remote files repeatedly so it’s a good practice to disk cache this data, but when scaling down data from cache it’s lost.
6.6 Automatic termination
Auto termination is a feature that enables the termination of unused clusters after a period of inactivity. For All Purpose Clusters, the auto termination should be set to a high number of minutes, like 90 minutes, as technicians are using this cluster for interactive development, and the time for the cluster setting up after termination will be equivalent to the time the technicians cannot develop. For Job Clusters auto termination isn’t required since these are scheduled jobs.
6.7 Local Disk Encryption
Some instance types that are used to run clusters may have locally attached disks. Azure Databricks may store shuffle data or ephemeral data on these locally attached disks. To ensure that all data at rest is encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local disks, you can enable local disk encryption. Your workloads may run more slowly because of the performance impact of reading and writing encrypted data to and from local volumes. Local Disk Encryption should be considered when workloads using the compute manage sensitive data.
6.8 Tags
You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports. For clusters launched from pools, the custom cluster tags are only applied to DBU usage reports and do not propagate to cloud resources. The use of tags makes governance of resources more easily, as resources can be filtered, checked and monitored using these tags.
6.9 Advanced?Options
Further advanced options can be configured by setting Spark configuration, environment variables, logging destination and init scripts.
The Spark configuration is used to fine tune Spark jobs, by providing custom Spark configuration properties. It can be configured with custom environment variables that can be accessed from init scripts running on the cluster. You can specify a location to deliver the logs for the Spark driver node, worker nodes, and events. Logs are delivered every five minutes and archived hourly in your chosen destination. When a cluster is terminated, Azure Databricks guarantees to deliver all logs generated up until the cluster is terminated.
7. SQL Warehouse Configuration Options
When deploying a SQL Warehouses there are a bunch of options that need to be configured. These options include cluster size, autostop, scaling, type of SQL Warehouse and some advanced options [11].
Cluster Size represents the size of the driver node and number of worker nodes associated with the cluster. To reduce query latency, it needs to be increased the size. Databricks SQL Warehouse supports three types: Classic, Pro and Serverless.?
The Classic SQL Warehouse type is the original SQL Warehouse type. It offers a limited set of features and performance optimizations.
The Pro SQL Warehouse type offers a more comprehensive set of features and performance optimizations than the Classic SQL Warehouse type. It also includes support for advanced Databricks SQL features, such as Delta Cache.
The Serverless SQL Warehouse type is the most advanced SQL Warehouse type. It offers the best performance and scalability, as well as support for all Databricks SQL features [12].
The best SQL Warehouse type to choose will depend on your specific needs. If you have a limited budget, then the Classic SQL Warehouse type is a good option. If you need more features and performance, then the Pro SQL Warehouse type is a good option. And if you need the best possible performance and scalability, then the Serverless SQL Warehouse type is the best option. To learn about the latest Databricks SQL features for each type of SQL Warehouse, check Databricks SQL release notes Databricks SQL release notes?—?Azure Databricks?—?Databricks SQL | Microsoft Learn
Auto Stop it’s a feature that determines whether the warehouse stops if it’s idle for the specified number of minutes. Idle SQL warehouses continue to accumulate DBU and cloud instance charges until they are stopped. For Pro and classic SQL warehouses typical use cases it’s recommended 45 minutes, while for Serverless SQL Warehouse typical use cases 10 minutes it’s recommended
Scaling sets the minimum and maximum number of clusters that will be used for a query. Databricks recommends a cluster for every 10 concurrent queries.
For the advanced options can be configured Tags, the use of Unity Catalog and channels. Tags as with the previous compute options makes governance of resources more easily, as resources can be filtered and be checked and monitored using these tags. With respect to channel it can be used current channel with functionality previously tested and preview channel to test new functionality until it becomes a standard for the current channel.
8. Pool Configuration Options
For the Pools it can be configured with a minimum and maximum number of idle instances, auto termination time, instance type, databricks runtime version, if using Photon, if using on demand or Spot instances, and tags.
Fine tune of the minimum and maximum number of idle instances should be made taking in consideration how fast start and auto scaling it’s needed for the workload, at the cost of having ready-to-use instances.
The instance types, runtime versions available are the ones explained with All Purpose and Job clusters. Furthermore the Photon and Spot instances feature are the same features explained for Job Clusters and All Purpose Clusters.
Tags for pools as with the previous compute options makes governance of resources more easily, as resources can be filtered and be checked and monitored using these tags.
Liked the content?
You can subscribe to receive an email every time I publish a new story
Want to connect?
References
[1] Mssaperla. (2023, October 16). Compute?—?Azure Databricks. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/clusters/
[2] Databricks pricing | Databricks. (n.d.). Databricks. https://www.databricks.com/product/pricing
[3] Mssaperla. (2023b, October 10). Best practices: Cluster configuration?—?Azure Databricks. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/clusters/cluster-config-best-practices
[4] Miah, F. (2023, May 12). Introduction to DataBricks SQL?—?Advancing Analytics. Advancing Analytics. https://www.advancinganalytics.co.uk/blog/2023/4/6/introduction-to-databricks-sql
[5] Chris Stevens and David Meyer. (2019, November 11). Speed Up Your Data Pipeline with Databricks Pools. Databricks. https://www.databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html
[6] DataBricks Platform & Add-Ons. (n.d.). Databricks. https://www.databricks.com/product/pricing/platform-addons
[7] Anindita Mahapatra and Stephen Carman. (2023, May 11). Cluster Policy Onboarding primer. Databricks. https://www.databricks.com/blog/cluster-policy-onboarding-primer
[8] Joseph Bradley, Tim Hunter and Yandong Mao. (2016, October 27). GPU Acceleration in Databricks Databricks. https://www.databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html
[9] Miah, F. (2021, September 27). DataBricks Delta Cache and Spark Cache?—?Advancing Analytics. Advancing Analytics. https://www.advancinganalytics.co.uk/blog/2021/9/20/databricks-delta-cache-and-spark-cache
[10] Clinton Ford, Bruce Nelson and Rebecca Li. (2021, May 25). How to leverage azure spot instances for azure databricks. Databricks. https://www.databricks.com/blog/2021/05/25/leverage-unused-compute-capacity-for-data-ai-with-azure-spot-instances-and-azure-databricks.html
[11] Mssaperla. (2023d, October 12). Configure SQL warehouses?—?Azure Databricks?—?Databricks SQL. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/sql/admin/create-sql-warehouse
[12] Mssaperla. (2023j, October 30). What is data warehousing on Azure Databricks??—?Azure Databricks?—?Databricks SQL. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/sql/#--what-are-the-available-warehouse-types
Data Engineer at Wellabe | UMBC Data Science Graduate
2 个月Great article to learn much about databricks clusters??