Modern Python project management with uv and Databricks Asset Bundles
Ivan Trusov
Enabling teams to build efficient Data Intelligence Platforms with Databricks. All views and opinions are my own.
The infrastructure for Python projects has undergone several changes in the last few years. Several years ago, the requirements.txt file was one of the most common approaches, despite being clunky and leading to problems with updates. The empty niche of convenient and simple dependency/project management was quickly filling up in recent years with various tools—for example, poetry, hatch, and the recent novelty from Astral called uv.
I had a chance to work with poetry and hatch earlier, and my developer experience was somewhat convenient, although with minor nitpicks.
Since uv is a relatively new tool with promising performance and user experience, I decided to give it a shot and build a sample Databricks project with it. In this article I'll describe some details of setting up a project, managing local environment and configuring dependencies to be installed on Databricks while running the workflow.
Initializing the project
The prerequisites for the setup are as follows:
In most of my projects, I prefer an src/ based layout for my projects. This allows for the convenient separation of directories into src and test folders. Fortunately, uv supports this approach and calls it "Packaged Applications":
uv init --package --name=dabs-with-uv dabs-with-uv
This command will create a new directory with the following structure:
dabs-with-uv
├── README.md
├── pyproject.toml
└── src
└── dabs_with_uv
└── __init__.py
I also prefer to explicitly pin the Python version to fit the one in Serverless Environment. You can quickly check the current version by running a notebook on Serverless:
Now I can setup the version in my local project. To do this, I need to setup the version in pyproject.toml:
# find this line in pyproject.toml and set to the relevant one
requires-python = ">=3.10"
This sets up the package requirement, but doesn't change local Python version. To pin it, use uv:
# assuming you're in the project directory
uv python pin 3.10
Creating the local venv and configuring it in VSCode
To trigger the creation of venv, run this command in project directory:
uv sync
In the output, you'll see the path of the newly created virtual environment. You then can set this environment for your IDE (in my case it's VS Code):
Now you can easily add more files, structure your project.
Managing dependencies
When managing dependencies for a Python project, you need to logically separate them into the following groups:
uv comes with convenient CLI tooling to add and group the dependencies:
uv add <package-name> # to add it as a main dependency
uv add --group <group-name> <package-name> # add dependency to a specific group
I'll cover the dependencies later when preparing a sample ETL task.
Versioning
Usually developers prefer to have the version dynamically defined based on the VCS source. In my example I'l using git as a VCS. In this perspective, uv has some compatibility with hatch, which allows using hatch-vcs and make the version dynamic:
# pyproject.toml
[project]
...
- version = ... # remove this line
+ dynamic = ["version"] # add this line
[build-system]
requires = ["hatchling", "hatch-vcs"]
[tool.hatch.version]
source = "vcs"
After saving these changes to the pyproject.toml, run the sync:
uv sync
And you'll see that version is now picked from git:
Installed 1 package in 2ms
- dabs-with-uv==0.1.0 (from file:///~/projects/dabs-with-uv)
+ dabs-with-uv==0.1.dev0+d20241225 (from file:///~/projects/dabs-with-uv)
Now you can use git-based tags to provide a version, for instance:
git tag -a "v0.1.2" -m "Release 0.1.2" # add new tag
uv build --wheel # run the build
In the build log, you'll see that the version has been bumped:
Building wheel...
Successfully built dist/dabs_with_uv-0.1.2-py3-none-any.whl
Adding an ETL task
I'll use a very simplistic ETL task in this example since it's not the main focus. You can see the snippet below:
class Aggregator:
# some details snipped out for cohesiveness
def run(self):
logger.info("Preparing aggregated orders table")
self._prepare_catalog()
orders_source = self.spark.table(self.config.input_source.orders.full_name)
# limit if provided in the config
orders = (
orders_source
if not self.config.limit
else orders_source.limit(self.config.limit)
)
# Aggregating orders by o_custkey
orders_aggregated = orders.groupBy("o_custkey").agg(
{"o_totalprice": "sum", "o_orderkey": "count"}
)
orders_aggregated = orders_aggregated.withColumnRenamed(
"sum(o_totalprice)", "total_price"
)
orders_aggregated = orders_aggregated.withColumnRenamed(
"count(o_orderkey)", "order_count"
)
# Writing aggregated orders to the output sink
orders_aggregated.write.saveAsTable(
self.config.output_sink.orders_analytics.full_name,
mode="overwrite",
)
logger.info("Aggregated orders table created")
As you can see from the code, the idea is to take the data from samples.tpch schema and then aggregate it, saving the result to the output table. This might not be the most optimal ETL approach for real-life scenarios, but this discussion is out of topic for this article.
Testing with Databricks Connect
There are two options to test this code:
领英推荐
I'll focus on setting up a Databricks Connect based example. With the Serverless capabilities, testing with Databricks Connect became a bliss, eliminating the need to configure the local environment to run tests.
To add databricks-connect package without adding it to the main project dependencies, a dev group in uv is used. Users can optionally specify version limits while adding packages:
uv add --dev 'databricks-connect<16'
A good test usually doesn't require a developer to change anything in the main source code. To achieve this in practice, developers mock out everything that's not necessary and use as little direct property passing as possible, concentrating on approaches like dependency injection.
The usage of Databricks Connect simplifies the testing logic to a good extent. I only need to define a session-scoped fixture:
from pathlib import Path
import pytest
from dabs_with_uv.logger import logger
from dotenv import load_dotenv
from databricks.connect import DatabricksSession
@pytest.fixture(scope="session", autouse=True)
def session() -> None:
dotenv_file = (
Path(__file__).parent.parent.parent / ".env"
) # tests/connect/conftest.py -> ../../.env (project root)
if dotenv_file:
logger.info(f"Loading environment variables from {dotenv_file}")
load_dotenv(dotenv_file)
else:
logger.warning("No .env file found. Environment variables will not be loaded.")
logger.info("Creating Databricks session")
_ = (
DatabricksSession.builder.serverless()
.getOrCreate()
)
logger.info("Databricks session created")
What this fixture does is that it loads environment variables from .env file in the root of the project, and then spins up a Databricks Connect session for Spark. I don't need to change anything in the Aggregator code, because it picks the SparkSession in the following way:
class Aggregator:
def __init__(self, config: Config):
self.config = config
self.spark = SparkSession.builder.getOrCreate()
So it will automatically pick the session from the runtime.
Given this fixture, the only actual hard dependency is the configuration of the workflow. This one can be passed as follows:
from dabs_with_uv.config import Config, OutputSink, Table
from dabs_with_uv.aggregator import Aggregator
def test_aggregator():
config = Config(
limit=100,
output_sink=OutputSink(
orders_analytics=Table(
catalog="main", db="default", name="orders_analytics"
)
),
)
aggregator = Aggregator(config)
aggregator.run()
# check that the output table was created
assert aggregator.spark.catalog.tableExists(
config.output_sink.orders_analytics.full_name
)
# check that output table is not empty
assert (
aggregator.spark.table(config.output_sink.orders_analytics.full_name).count()
> 0
)
Indeed the assertions provided above are a bit basic, and probably more structural or business logic checks shall be introduced to improve coverage surface.
Packaging and deploying with DABs
This simplistic ETL task seems to be completed and covered with tests, which means it's time to prepare the deployment configuration.
The first step is to add an entrypoint function in Python code. This function will be triggered when the task is started in the workflow:
def entrypoint():
config = Config.from_args()
aggregator = Aggregator(config)
aggregator.run()
This function needs to be referenced as an entrypoint (also sometimes called script) in pyproject.toml:
[project.scripts]
aggregator = "dabs_with_uv.aggregator:entrypoint"
In a more generic terms, the path should follow this format: <package>.<subpackage>:<function_name>.
This is it on the Python side - now time to author the databricks.yml file. The bundle requires a name:
bundle:
name: dabs-with-uv
And since we're using uv, we need to provide custom artifact logic:
artifacts:
default:
type: whl
path: .
build: uv build --wheel
What this means is that before running actual deployment, the uv build command will be launched, and the generated wheel file will be picked.
To flexibly configure the arguments, DABs provide variables capability:
variables:
catalog:
type: string
description: The output catalog name
default: main
schema:
type: string
description: The output schema name
default: default
table:
type: string
description: The output table name
default: orders_analytics
Developers can pass variables during deployment:
databricks bundle deploy --var="catalog=ivt" --var="schema=default" -t dev
And to pick the variable on the Python side, a task needs to be defined:
resources:
jobs:
dabs-with-uv:
name: dabs-with-uv
tasks:
- task_key: main
max_retries: 0
disable_auto_optimization: true
python_wheel_task:
package_name: dabs_with_uv
entry_point: aggregator
parameters:
- ${var.catalog}
- ${var.schema}
- ${var.table}
environment_key: Default
environments:
- environment_key: Default
spec:
client: "1"
dependencies:
- ./dist/*.whl
Please note several things in this configuration:
Since parameters need to be passed, additional parsing needs to be introduced on the Python side. Although in more complex scenarios tools like click or Typer can be used, for simplicity's sake I'm just using sys.argv:
class Config:
# snip
@staticmethod
def from_args() -> Config:
args = sys.argv[1:]
if len(args) != 3:
raise ValueError(
"Supported arguments: <output_catalog> <output_db> <output_table>"
)
output_catalog, output_db, output_table = args[:3]
return Config(
output_sink=OutputSink(
orders_analytics=Table(
catalog=output_catalog, db=output_db, name=output_table
)
)
)
After putting this together, a deployment is as easy as a single CLI command:
databricks bundle deploy --var="catalog=..." --var="schema=..." -t dev
The workflow will be automatically created or updated on the Databricks side. You can manage deployments into different workspaces by using the targets section in the databricks.yml file.
Summary
The uv tool, although relatively new, provides full set of convenient capabilities for developers to manage and setup Python projects. The groups' mechanism provides an approach to logically separate dependencies into "local" and required ones.
Combined with Databricks Asset Bundles, it can serve a good tech stack to easily manage and deploy Python projects and applications on the Databricks platform.
The full source code can be found on my GitHub, in this repository.
Finally, please hit like, share, and subscribe to me on LinkedIn. If you like what I'm doing - this is the best way to support me as an author ?
Senior Analytics Engineer — I collaborate on crafting solutions that involve people, data, and tools to solve problems
2 个月Outstanding article Ivan Trusov. One question, in your case you are running the unit tests via Databricks clusters. Is there a way to run locally/ci agent? So that I don’t tie the unit tests with Databricks? I’ve been following up this issue on Databricks community ( https://community.databricks.com/t5/data-engineering/installing-databricks-connect-breaks-pyspark-local-cluster-mode/m-p/102536#M41152 ) Thanks for the useful tips ( especially the dynamic versioning) ??
Data Architect @ Clear Channel Europe | Data Architecture, Data Engineering, Cloud Solutions
2 个月I’m currently using DABs with Hatch but I’m so looking forward to exploring uv after all the reviews I’ve read so far. One little detail that i hope uv will cover is the dynamic versioning which prevents databricks from picking up an older wheel and enaures that the latest wheel with the highest version is always being picked up.
Senior Solution Architect AI/ML at Databricks EMEA
2 个月Useful tips