Navigating Databricks developer tools
Developing on Databricks outside of Databricks environment is challenging, and there are 4 main developer tools Databricks provides:
In this article, we focus on Databricks CLI, Databricks Connect, and VS Code Extension. We will not go through all the features of all these tools but will share some (not very obvious) findings that hopefully will help you in your development and debugging process.
Databricks CLI
1.Installing the CLI. Databricks has very good documentation on how to install it: https://docs.databricks.com/en/dev-tools/cli/install.html
From our experience, the homebrew option works great on MacOS. On Window — winget. Otherwise, you can always install from a source build.
2.Authentication. Databricks CLI should be used to authenticate towards Databricks from your local machine.
We do not recommend using personal access tokens (from a security perspective, this is not the best option). Instead, use user-to-machine authentication.
For the workspace-level commands, use the following command to authenticate:
databricks auth login --host <workspace-url>
It creates an authentication profile in .databrickscfg file that looks like this (this is not a real host, just an example):
[dbc-1234a567-b8c9]
host = https://dbc-1234a567-b8c9.cloud.databricks.com/
auth_type = databricks-cli
One thing to pay attention to is the evaluation order:
If you happen to have an environment variable set up, it might mess up with authentication.
Databricks Connect
Let’s try out Databricks Connect now without using VS Code Extension.
1.Create & start a cluster. We use the smallest Personal compute cluster with 15.4 LTS runtime:
2.Create a demo folder. Create a virtual environment and install databricks-connect package. We must use Python 3.11.x and databricks-connect version 15.4.x — to match the runtime.
mkdir demo
cd demo
uv venv -p 3.11 .venv
source .venv/bin/activate
uv pip install databricks-connect==15.4.4
python
3.Run the code. Now we got into the Python console from the terminal. Let’s try to run some commands:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.profile("dbc-1234a567-b8c9").getOrCreate()
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)
A couple of things to note here:
领英推荐
It is because pyspark is “hidden” and embedded into databricks-connect (and that’s why it is a bad idea to list both databricks-connect and pyspark in your project dependencies)
You can see it by going to .venv/lib/python3.11/site-packages:
4.Let’s try to use SparkSession instead!
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)
It should not surprise you, that this would not work. Import works without issues (because pyspark is installed), but when we try to run the second line, we get the RuntimeError: “Only remote Spark sessions using Databricks Connect are supported. Use DatabricksSession.builder to create a remote Spark session instead.”
Interestingly enough, this works with VS Code Extension! Let’s get into it!
VS Code Extension
Let’s try out Databricks Connect now without using VS Code Extension.
1.Install the extension. Follow the official documentation.
2.Open the demo project. It does not have any files except .venv folder. Click on the Databricks logo -> create configuration -> choose your profile.
3. All green checkmarks: Activate your environment, install pip, and select your cluster.
4.Let’s check the files. We can see that The extension created several files, including databricks.yml file. The host in the file must match the host that is specified in the profile.
5.Let’s create a demo.py file (the same code we tried to run from a terminal). We’ll also add the first line “# Databricks notebook source”- and the code can be recognized as a notebook.
This works because the extension created configuration in the .vscode folder (settings.json):
{
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------"
}
The code can also run in VS Code (Spark session can be created — because the extension created all necessary environment variables and stored them in .databricks.env in .databricks folder.
Unfortunately, the code of databricks-connect is proprietary, and we can not tell how exactly the environment variables get evaluated.
The only thing we know, there is no magic :-)
Conclusions
We hope this article helps clarify some most common confusion when developers try to develop outside of a Databricks environment.
Check out a related article about developing on Databricks: https://medium.com/marvelous-mlops/developing-on-databricks-without-compromises-10d623d6301a
Cloud Engineering and Architecture | Cloud Infrastructure | Cloud Security | MLOps and GenAI
1 个月Tim Licquia, Jesse Grekowicz