Navigating Databricks developer tools

Navigating Databricks developer tools

Developing on Databricks outside of Databricks environment is challenging, and there are 4 main developer tools Databricks provides:

  • Databricks CLI: a command line interface that allows you to interact with Databricks platform. It is a very powerful tool with a large range of commands (essentially, all the functionality available via API or Terraform is also available via CLI).
  • Databricks asset bundles: developers tools that allow you to simplify deployment of various assets on Databricks. Databricks bundle commands are part of the CLI. Check out a related article.
  • Databricks Connect — a Python package (there is also support for Scala and R) that allows you to trigger an execution of spark code on a Databricks cluster from a local environment
  • Databricks VS Code Extension — an extension that integrates with all 3 other tools: Databricks connect, Databricks asset bundles, and Databricks connect. Makes it very easy to connect to the Databricks workspace and execute your code using a Databricks cluster.

In this article, we focus on Databricks CLI, Databricks Connect, and VS Code Extension. We will not go through all the features of all these tools but will share some (not very obvious) findings that hopefully will help you in your development and debugging process.

Databricks CLI

1.Installing the CLI. Databricks has very good documentation on how to install it: https://docs.databricks.com/en/dev-tools/cli/install.html

From our experience, the homebrew option works great on MacOS. On Window — winget. Otherwise, you can always install from a source build.

2.Authentication. Databricks CLI should be used to authenticate towards Databricks from your local machine.

We do not recommend using personal access tokens (from a security perspective, this is not the best option). Instead, use user-to-machine authentication.

For the workspace-level commands, use the following command to authenticate:

databricks auth login --host <workspace-url>        

It creates an authentication profile in .databrickscfg file that looks like this (this is not a real host, just an example):

[dbc-1234a567-b8c9]
host      = https://dbc-1234a567-b8c9.cloud.databricks.com/
auth_type = databricks-cli        

One thing to pay attention to is the evaluation order:

  1. For any command run from the bundle working directory — the values of fields within a project’s bundle setting files.
  2. The values of environment variables (DATABRICKS_HOST and DATABRICKS_TOKEN)
  3. Configuration profile field values within the .databrickscfg file.

If you happen to have an environment variable set up, it might mess up with authentication.

Databricks Connect

Let’s try out Databricks Connect now without using VS Code Extension.

1.Create & start a cluster. We use the smallest Personal compute cluster with 15.4 LTS runtime:

2.Create a demo folder. Create a virtual environment and install databricks-connect package. We must use Python 3.11.x and databricks-connect version 15.4.x — to match the runtime.

mkdir demo
cd demo
uv venv -p 3.11 .venv
source .venv/bin/activate
uv pip install databricks-connect==15.4.4
python        

3.Run the code. Now we got into the Python console from the terminal. Let’s try to run some commands:

from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.profile("dbc-1234a567-b8c9").getOrCreate()
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)        

A couple of things to note here:

  1. You need to choose the correct profile from the .databrickscfg file. If the profile is named DEFAULT, it is not needed.
  2. DatabricksSession will be used to run spark code. If you, for example, run df.toPandas() — which may be needed for an ML library you are using, the dataframe will be loaded into your machine’s memory. And if the data is too big, you will get out of memory errors.
  3. It may surprise you, but you can also import pyspark — even though pyspark is not listed as a dependency of databricks-connect, and uv does not show that pyspark is installed.

It is because pyspark is “hidden” and embedded into databricks-connect (and that’s why it is a bad idea to list both databricks-connect and pyspark in your project dependencies)

You can see it by going to .venv/lib/python3.11/site-packages:

4.Let’s try to use SparkSession instead!

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)        

It should not surprise you, that this would not work. Import works without issues (because pyspark is installed), but when we try to run the second line, we get the RuntimeError: “Only remote Spark sessions using Databricks Connect are supported. Use DatabricksSession.builder to create a remote Spark session instead.”

Interestingly enough, this works with VS Code Extension! Let’s get into it!

VS Code Extension

Let’s try out Databricks Connect now without using VS Code Extension.

1.Install the extension. Follow the official documentation.

2.Open the demo project. It does not have any files except .venv folder. Click on the Databricks logo -> create configuration -> choose your profile.

3. All green checkmarks: Activate your environment, install pip, and select your cluster.

4.Let’s check the files. We can see that The extension created several files, including databricks.yml file. The host in the file must match the host that is specified in the profile.

5.Let’s create a demo.py file (the same code we tried to run from a terminal). We’ll also add the first line “# Databricks notebook source”- and the code can be recognized as a notebook.

This works because the extension created configuration in the .vscode folder (settings.json):

{
    "jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
    "jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------"
}        

The code can also run in VS Code (Spark session can be created — because the extension created all necessary environment variables and stored them in .databricks.env in .databricks folder.

Unfortunately, the code of databricks-connect is proprietary, and we can not tell how exactly the environment variables get evaluated.

The only thing we know, there is no magic :-)

Conclusions

We hope this article helps clarify some most common confusion when developers try to develop outside of a Databricks environment.

Check out a related article about developing on Databricks: https://medium.com/marvelous-mlops/developing-on-databricks-without-compromises-10d623d6301a

Vitor Augusto Silva

Cloud Engineering and Architecture | Cloud Infrastructure | Cloud Security | MLOps and GenAI

1 个月

要查看或添加评论,请登录

Marvelous MLOps的更多文章

社区洞察

其他会员也浏览了