登录查看更多内容

Modern Python project management with uv and Databricks Asset Bundles

Ivan Trusov

Enabling teams to build efficient Data Intelligence Platforms with Databricks. All views and opinions are my own.

发布日期: 2024年12月25日

The infrastructure for Python projects has undergone several changes in the last few years. Several years ago, the requirements.txt file was one of the most common approaches, despite being clunky and leading to problems with updates. The empty niche of convenient and simple dependency/project management was quickly filling up in recent years with various tools—for example, poetry, hatch, and the recent novelty from Astral called uv.

I had a chance to work with poetry and hatch earlier, and my developer experience was somewhat convenient, although with minor nitpicks.

Since uv is a relatively new tool with promising performance and user experience, I decided to give it a shot and build a sample Databricks project with it. In this article I'll describe some details of setting up a project, managing local environment and configuring dependencies to be installed on Databricks while running the workflow.

Initializing the project

The prerequisites for the setup are as follows:

uv
Databricks CLI
Databricks workspace

In most of my projects, I prefer an src/ based layout for my projects. This allows for the convenient separation of directories into src and test folders. Fortunately, uv supports this approach and calls it "Packaged Applications":

uv init --package --name=dabs-with-uv dabs-with-uv

This command will create a new directory with the following structure:

dabs-with-uv
├── README.md
├── pyproject.toml
└── src
    └── dabs_with_uv
        └── __init__.py

I also prefer to explicitly pin the Python version to fit the one in Serverless Environment. You can quickly check the current version by running a notebook on Serverless:

Run these lines on a Serverless cluster to get the available version

Now I can setup the version in my local project. To do this, I need to setup the version in pyproject.toml:

# find this line in pyproject.toml and set to the relevant one
requires-python = ">=3.10"

This sets up the package requirement, but doesn't change local Python version. To pin it, use uv:

# assuming you're in the project directory
uv python pin 3.10

Creating the local venv and configuring it in VSCode

To trigger the creation of venv, run this command in project directory:

uv sync

In the output, you'll see the path of the newly created virtual environment. You then can set this environment for your IDE (in my case it's VS Code):

VS Code usually notices the prepared path and recommends it

Now you can easily add more files, structure your project.

Managing dependencies

When managing dependencies for a Python project, you need to logically separate them into the following groups:

Main dependencies of your project that are used during the runtime of your workflows
Testing and local dependencies
Development-cycle related dependencies (e.g. linters, formatters)

uv comes with convenient CLI tooling to add and group the dependencies:

uv add <package-name> # to add it as a main dependency
uv add --group <group-name> <package-name> # add dependency to a specific group

I'll cover the dependencies later when preparing a sample ETL task.

Versioning

Usually developers prefer to have the version dynamically defined based on the VCS source. In my example I'l using git as a VCS. In this perspective, uv has some compatibility with hatch, which allows using hatch-vcs and make the version dynamic:

# pyproject.toml
[project]
...
- version = ... # remove this line
+ dynamic = ["version"] # add this line

[build-system]
requires = ["hatchling", "hatch-vcs"]

[tool.hatch.version]
source = "vcs"

After saving these changes to the pyproject.toml, run the sync:

uv sync

And you'll see that version is now picked from git:

Installed 1 package in 2ms
 - dabs-with-uv==0.1.0 (from file:///~/projects/dabs-with-uv)
 + dabs-with-uv==0.1.dev0+d20241225 (from file:///~/projects/dabs-with-uv)

Now you can use git-based tags to provide a version, for instance:

git tag -a "v0.1.2" -m "Release 0.1.2"  # add new tag
uv build --wheel # run the build

In the build log, you'll see that the version has been bumped:

Building wheel...
Successfully built dist/dabs_with_uv-0.1.2-py3-none-any.whl

Adding an ETL task

I'll use a very simplistic ETL task in this example since it's not the main focus. You can see the snippet below:

class Aggregator:
    # some details snipped out for cohesiveness 

    def run(self):
        logger.info("Preparing aggregated orders table")
        self._prepare_catalog()

        orders_source = self.spark.table(self.config.input_source.orders.full_name)

        # limit if provided in the config
        orders = (
            orders_source
            if not self.config.limit
            else orders_source.limit(self.config.limit)
        )

        # Aggregating orders by o_custkey
        orders_aggregated = orders.groupBy("o_custkey").agg(
            {"o_totalprice": "sum", "o_orderkey": "count"}
        )
        orders_aggregated = orders_aggregated.withColumnRenamed(
            "sum(o_totalprice)", "total_price"
        )
        orders_aggregated = orders_aggregated.withColumnRenamed(
            "count(o_orderkey)", "order_count"
        )

        # Writing aggregated orders to the output sink
        orders_aggregated.write.saveAsTable(
            self.config.output_sink.orders_analytics.full_name,
            mode="overwrite",
        )

        logger.info("Aggregated orders table created")

As you can see from the code, the idea is to take the data from samples.tpch schema and then aggregate it, saving the result to the output table. This might not be the most optimal ETL approach for real-life scenarios, but this discussion is out of topic for this article.

Testing with Databricks Connect

There are two options to test this code:

领英推荐

Strategies for Deploying Modern Applications!

Pavan Belagatti 2 年前

FastAPI: Revolutionizing API-Driven Software…

Vintage Global 5 个月前

Reactive Programming

Yeshwanth Chintaginjala 1 年前

Locally with Apache Spark and Delta Lake
On Databricks via Databricks Connect

I'll focus on setting up a Databricks Connect based example. With the Serverless capabilities, testing with Databricks Connect became a bliss, eliminating the need to configure the local environment to run tests.

To add databricks-connect package without adding it to the main project dependencies, a dev group in uv is used. Users can optionally specify version limits while adding packages:

uv add --dev 'databricks-connect<16'

A good test usually doesn't require a developer to change anything in the main source code. To achieve this in practice, developers mock out everything that's not necessary and use as little direct property passing as possible, concentrating on approaches like dependency injection.

The usage of Databricks Connect simplifies the testing logic to a good extent. I only need to define a session-scoped fixture:

from pathlib import Path
import pytest
from dabs_with_uv.logger import logger
from dotenv import load_dotenv
from databricks.connect import DatabricksSession


@pytest.fixture(scope="session", autouse=True)
def session() -> None:
    dotenv_file = (
        Path(__file__).parent.parent.parent / ".env"
    )  # tests/connect/conftest.py -> ../../.env (project root)
    if dotenv_file:
        logger.info(f"Loading environment variables from {dotenv_file}")
        load_dotenv(dotenv_file)
    else:
        logger.warning("No .env file found. Environment variables will not be loaded.")

    logger.info("Creating Databricks session")
    _ = (
        DatabricksSession.builder.serverless()
        .getOrCreate()
    )
    logger.info("Databricks session created")

What this fixture does is that it loads environment variables from .env file in the root of the project, and then spins up a Databricks Connect session for Spark. I don't need to change anything in the Aggregator code, because it picks the SparkSession in the following way:

class Aggregator:

    def __init__(self, config: Config):
        self.config = config
        self.spark = SparkSession.builder.getOrCreate()

So it will automatically pick the session from the runtime.

Given this fixture, the only actual hard dependency is the configuration of the workflow. This one can be passed as follows:

from dabs_with_uv.config import Config, OutputSink, Table
from dabs_with_uv.aggregator import Aggregator


def test_aggregator():
    config = Config(
        limit=100,
        output_sink=OutputSink(
            orders_analytics=Table(
                catalog="main", db="default", name="orders_analytics"
            )
        ),
    )
    aggregator = Aggregator(config)
    aggregator.run()

    # check that the output table was created
    assert aggregator.spark.catalog.tableExists(
        config.output_sink.orders_analytics.full_name
    )
    # check that output table is not empty
    assert (
        aggregator.spark.table(config.output_sink.orders_analytics.full_name).count()
        > 0
    )

Indeed the assertions provided above are a bit basic, and probably more structural or business logic checks shall be introduced to improve coverage surface.

Packaging and deploying with DABs

This simplistic ETL task seems to be completed and covered with tests, which means it's time to prepare the deployment configuration.

The first step is to add an entrypoint function in Python code. This function will be triggered when the task is started in the workflow:

def entrypoint():
    config = Config.from_args()
    aggregator = Aggregator(config)
    aggregator.run()

This function needs to be referenced as an entrypoint (also sometimes called script) in pyproject.toml:

[project.scripts]
aggregator = "dabs_with_uv.aggregator:entrypoint"

In a more generic terms, the path should follow this format: <package>.<subpackage>:<function_name>.

This is it on the Python side - now time to author the databricks.yml file. The bundle requires a name:

bundle:
  name: dabs-with-uv

And since we're using uv, we need to provide custom artifact logic:

artifacts:
  default:
    type: whl
    path: .
    build: uv build --wheel

What this means is that before running actual deployment, the uv build command will be launched, and the generated wheel file will be picked.

To flexibly configure the arguments, DABs provide variables capability:

variables:
  catalog:
    type: string
    description: The output catalog name
    default: main
  schema:
    type: string
    description: The output schema name
    default: default
  table:
    type: string
    description: The output table name
    default: orders_analytics

Developers can pass variables during deployment:

databricks bundle deploy --var="catalog=ivt" --var="schema=default" -t dev

And to pick the variable on the Python side, a task needs to be defined:

resources:
  jobs:
    dabs-with-uv:
      name: dabs-with-uv

      tasks:
        - task_key: main
          max_retries: 0
          disable_auto_optimization: true
          python_wheel_task:
            package_name: dabs_with_uv
            entry_point: aggregator
            parameters:
              - ${var.catalog}
              - ${var.schema}
              - ${var.table}

          environment_key: Default

      environments:
        - environment_key: Default
          spec:
            client: "1"
            dependencies:
              - ./dist/*.whl

Please note several things in this configuration:

We add parameters via ${var.*} substitution.
The package name needs to point to the actual package name. Note: if the package name has dashes (-), it will be automatically converted to _ (underscores) on Databricks side, so put the package name correctly in DABs.
The entry_point selects which entrypoint from [project.scripts] shall be used. Since we've configured the entrypoint called aggregator in the section above, the same name is used here.
The dependencies in the environments section contain a pointer to ./dist/*.whl - this is exactly the local directory where the uv build will output the wheel file. It will be automatically picked, uploaded, and referenced in the workflow settings.

Since parameters need to be passed, additional parsing needs to be introduced on the Python side. Although in more complex scenarios tools like click or Typer can be used, for simplicity's sake I'm just using sys.argv:

class Config:
     # snip

    @staticmethod
    def from_args() -> Config:
        args = sys.argv[1:]

        if len(args) != 3:
            raise ValueError(
                "Supported arguments: <output_catalog> <output_db> <output_table>"
            )

        output_catalog, output_db, output_table = args[:3]

        return Config(
            output_sink=OutputSink(
                orders_analytics=Table(
                    catalog=output_catalog, db=output_db, name=output_table
                )
            )
        )

After putting this together, a deployment is as easy as a single CLI command:

databricks bundle deploy --var="catalog=..." --var="schema=..." -t dev

The workflow will be automatically created or updated on the Databricks side. You can manage deployments into different workspaces by using the targets section in the databricks.yml file.

Summary

The uv tool, although relatively new, provides full set of convenient capabilities for developers to manage and setup Python projects. The groups' mechanism provides an approach to logically separate dependencies into "local" and required ones.

Combined with Databricks Asset Bundles, it can serve a good tech stack to easily manage and deploy Python projects and applications on the Databricks platform.

The full source code can be found on my GitHub, in this repository.

Finally, please hit like, share, and subscribe to me on LinkedIn. If you like what I'm doing - this is the best way to support me as an author ?

Douglas Pires

Senior Analytics Engineer — I collaborate on crafting solutions that involve people, data, and tools to solve problems

2 个月

Outstanding article Ivan Trusov. One question, in your case you are running the unit tests via Databricks clusters. Is there a way to run locally/ci agent? So that I don’t tie the unit tests with Databricks? I’ve been following up this issue on Databricks community ( https://community.databricks.com/t5/data-engineering/installing-databricks-connect-breaks-pyspark-local-cluster-mode/m-p/102536#M41152 ) Thanks for the useful tips ( especially the dynamic versioning) ??

Jean B.

Data Architect @ Clear Channel Europe | Data Architecture, Data Engineering, Cloud Solutions

2 个月

I’m currently using DABs with Hatch but I’m so looking forward to exploring uv after all the reviews I’ve read so far. One little detail that i hope uv will cover is the dynamic versioning which prevents databricks from picking up an older wheel and enaures that the latest wheel with the highest version is always being picked up.

Laurent Prat

Senior Solution Architect AI/ML at Databricks EMEA

2 个月

Useful tips

1 次回应

查看更多评论

要查看或添加评论，请登录

Ivan Trusov的更多文章

End-to-end RAG application with source retriveal on Databricks Platform

2025年2月12日

End-to-end RAG application with source retriveal on Databricks Platform

?? Intro Modern data platforms provide various ways to interact with data. Some cases include visual analysis tools…

10 条评论
Building data applications with Databricks Apps

2024年11月6日

Building data applications with Databricks Apps

With the recent introduction of Databricks Apps, the capabilities of the Databricks platform extended to also cover a…

20 条评论
Aevum Data Digitalis. Part 1 - operations, facts, and dimensions

2024年10月7日

Aevum Data Digitalis. Part 1 - operations, facts, and dimensions

Working in the data industry over the past several years has been a wild ride with all the new technologies…

2 条评论

Modern Python project management with uv and Databricks Asset Bundles

Ivan Trusov

Enabling teams to build efficient Data Intelligence Platforms with Databricks. All views and opinions are my own.

Initializing the project

Creating the local venv and configuring it in VSCode

Managing dependencies

Versioning

Adding an ETL task

Testing with Databricks Connect

领英推荐

Packaging and deploying with DABs

Summary

Ivan Trusov的更多文章

社区洞察

其他会员也浏览了

Asynchronous Patterns beyond async/await: Task Parallel Library and Dataflow Blocks

Code Coverage Vs. Test Coverage

Alternatives to Kustomize for Kubernetes Configuration Management

Writing Your First Microservice in Golang and Deploying it in Minikube

GitHub Copilot for Developers: How the AI-powered tool can improve productivity

Git Hooks

Integrating SAS? with GitHub

Coverity: Code Analysis Tool

Controlling Garbage Collection Threads in Docker: Best Practices

Foundational Concept - Docker

Initializing the project

Creating the local venv and configuring it in VSCode

Managing dependencies

Versioning

Adding an ETL task

Testing with Databricks Connect

领英推荐

Packaging and deploying with DABs

Summary

Ivan Trusov的更多文章

End-to-end RAG application with source retriveal on Databricks Platform

Building data applications with Databricks Apps

Aevum Data Digitalis. Part 1 - operations, facts, and dimensions

社区洞察

其他会员也浏览了

Asynchronous Patterns beyond async/await: Task Parallel Library and Dataflow Blocks

Code Coverage Vs. Test Coverage

Alternatives to Kustomize for Kubernetes Configuration Management

Writing Your First Microservice in Golang and Deploying it in Minikube

GitHub Copilot for Developers: How the AI-powered tool can improve productivity

Git Hooks

Integrating SAS? with GitHub

Coverity: Code Analysis Tool

Controlling Garbage Collection Threads in Docker: Best Practices

Foundational Concept - Docker