Slim CI with dbt Core: Efficient Pipelines Using Azure DevOps

Slim CI with dbt Core: Efficient Pipelines Using Azure DevOps


GitHub Repository


About the GitHub Repository

The sample code mentioned in this article is available in the Git repository linked above. This repository is a fork of Jaffle Shop, an open-source sandbox dbt project that is useful for demonstrating and experimenting with dbt workflows. If you want to test the project in your Azure DevOps environment, you can fork the repository and add it, along with the DevOps pipelines, to your Azure DevOps project.

The examples in this article are based on the Jaffle Shop data models for convenience. I have used Databricks as the data platform in my examples, but the code can easily be adapted to any dbt supported data platform such as Microsoft Fabric, Snowflake, BigQuery etc.


Introduction

In today's data operations, the ability to quickly identify and transform business requirements into business value is essential in an ever-changing competitive landscape.

This mindset has been prevalent for years within the field of software development, where we have seen the emergence of methodologies like DevOps and best practices such as CI/CD.

At the core of these concepts is the fundamental motivation to reduce the lead-time between identifying a requirement and delivering value. In the data space, we have also seen various tools and platforms begin to prioritize this approach, enabling data teams to apply the principles of DevOps (or DataOps) and CI/CD.

One of these tools is dbt by dbt Labs, Inc. In this article, I will explore how to harness the flexibility provided by the open source version of dbt, dbt Core, to create an efficient fully self-contained CI pipeline with Azure DevOps.

An in-depth explanation of dbt Core and its capabilities for your data project is beyond the scope of this piece. However, I have briefly discussed dbt Core as a tool in a previous article.


CI/CD Pipeline

For data teams using dbt-core, the CI/CD pipeline is typically divided into several stages. The below figure illustrates one such example of a CI/CD pipeline workflow:


Example of a CI/CD Pipeline with dbt-core


  • Committing Code: Team members typically work with requirements in git branches, allowing them to isolate the code changes from unrelated changes made by other team members.
  • Creating a Pull Request: When a developer believes that a certain requirement is fulfilled, a pull request is created on the source branch to start off the CI/CD process.
  • Static Code Analysis: The process of using code analysis tools to scan the pull request for violations against commonly accepted best practices. This is typically achieved by using "opinionated" linters such as SQLFluff, SQLLint, Pylint, etc.
  • Slim CI Tests: The focus of this article is on building and testing only the specific data models affected by the pull request in a CI environment. This ensures that no adverse effects have been introduced.
  • Automated Tests: Running other kinds of automated tests that validates the data project. This could be integration tests or unit tests on certain code units.
  • Peer Review: A manual inspection of the proposed change by one or more team members.



Slim CI Explained

A slim CI pipeline tries to narrow the scope of verifying a proposed code change, to only include code artifacts for inspection that have been affected by the branch or pull request being tested.

In a dbt context, Slim CI means running and testing only the dbt models directly impacted by the changes made by the developer. This includes:

  • The specific model(s) where changes have occurred.
  • The immediate downstream model(s) that reference the changed tables.

All things being equal, there is little reason to run any additional models in the project, as there is a very high chance that any defects introduced by the proposed change will be detected by running the narrowly scoped selection. Models further upstream or downstream should be unaffected, since neither they nor their immediate parent/child models have changed.

This approach is particularly useful for medium or large-sized dbt projects with a certain degree of complexity. It is also beneficial if you have data models that require significant time or resources to build, as it allows you to limit the amount of data warehouse compute used in your CI pipelines by skipping models unaffected by the code changes.

dbt Cloud has built-in support for slim CI, so you can quickly get up and running in this scenario. With dbt Core however, there is a bit of work to do before you are able to enjoy the fruits of a streamlined CI pipeline!



Slim CI With the dbt CLI

The dbt CLI command tool lends itself very well to incorporate into CI pipelines. In particular there are two critically important CLI arguments needed to support our pipeline:

  • state:modified: Selecting models by comparing the current project manifest with a different manifest from the production environment, to only select models that have changed. The path to an existing manifest is supplied via the state argument.
  • defer: Deferring dependencies from the selected models to the models/schema defined in the

To start, let's first try to build the Jaffle Shop project to get a baseline for the next steps. If we try to run dbt build from the command line on the Jaffle Shop project, we see the following output:

dbt build        

In total, 46 operations have been executed, split between models, unit tests, data tests and saved queries. We can see that the models in this example are built into the schema ci_main.

Now that all the models have been built in the data warehouse, let's save the dbt manifest previously generated into a folder called ci_main.



This json-based manifest contains all the dbt artifacts with references to the ci_main schema we just used. If we look at the dbt lineage graph via dbt docs, we can see the following DAG:

Jaffle Shop lineage graph with the stg_supplies model highlighted


If we then zoom in on stg_supplies, we can see that there are only two direct downstream dependencies namely supplies and order_items:

stg_supplies With Downstream Dependencies


Let us now try to run dbt build again with the following changes:

  • Building the models into a schema called ci
  • Pointing to the ci_main manifest and instructing dbt to only build changed models + the immediate next level of downstream dependencies
  • Deferring other dependencies to the models residing in the ci_main schema


dbt build --select "state:modified+1" --state "ci_main" --defer        

The state selector state:modified compared our current manifest with the ci_main manifest, and correctly comes to the conclusion that no changes exist.

The next step is to make a dummy change to stg_supplies by adding a new column to the model, and then re-running the previous command.


dbt build --select "state:modified+1" --state "ci_main" --defer        

This time, dbt correctly detected that stg_supplies was changed and selected it along with its +1 downstream dependencies, supplies and order_items. The additional dependencies needed to build order_items were referenced via the ci_main schema and will not be processed in this run.

This can dramatically reduce the time it takes to validate a change for larger data projects.


Azure DevOps Pipeline Implementation

Now that the underlying mechanisms in dbt for supporting slim CI have been established, it's time to operationalize it into a proper CI workflow using Azure DevOps pipelines.

Here’s how a slim CI setup could be implemented from an Azure DevOps perspective:


Note: This article assumes a strict separation between development, CI (or QA), and production environments. In practice, this means that data models do not cross environment boundaries, and each environment builds dbt models specifically for itself.

Of course, you can adapt these environments to suit your specific workflow.


Let's examine the workflows involved:

Developer workflow: A developer creates a feature branch and starts working on the implementation, committing code changes along the way (1). During development, dbt models are executed using the local development environment (command line tool), and the models are materialized in the development data warehouse (DWH).

CI Workflow: When the developer creates a pull request, a CI pipeline is triggered (2) to run the affected dbt models and any associated tests. If errors are discovered, the pipeline fails, and the developer must make changes to address the issues. The pipeline runs the dbt models against a CI environment and defers any unaffected models to the models residing in the ci_main schema. For this to work, a manifest containing the models present in the ci_main schema is downloaded from a storage account.

A second pipeline (3) is configured to react to any changes merged into the main branch. This pipeline builds the entire dbt project, including all models, into the ci_main schema in the CI environment. The dbt manifest generated from the main branch is then uploaded to a Storage Account. This pipeline is best set to a schedule-based trigger, running late at night or early in the morning based on all the changes merged into the main branch for that workday. If no changes are merged for a given workday, the pipeline isn't run, conserving compute resources on the data platform.

CD Workflow: When the proposed changes have passed all CI quality gates, including a peer review, the pull request can be merged into the main (production) branch. This triggers a deployment pipeline (4) that pushes the changes into production (not covered by this article).

The CI workflow is supported by two pipelines:

  1. Main Branch Pipeline: azure-pipelines-dbt-ci-main.yml
  2. Pull Request Pipeline: azure-pipelines-dbt-slim-ci.yml

The basic file structure looks like this:


profiles.yml

dbt requires a profile to connect to your data platform. In this example, we'll use Databricks. Instead of hard-coding the connection values and committing them to git, you can use Jinja templating to reference environment variables. This approach allows you to securely manage sensitive information and easily switch between environments.

profiles.yml


template-steps-init-dbt.yml

This pipeline template includes the necessary initialization steps to run dbt. It covers installing the correct Python version, the Python package handler pip, and the dbt-databricks adapter (which includes dbt Core).

It is referenced by both azure-pipelines-dbt-ci-main.yml and azure-pipelines-dbt-slim-ci.yml (see below)

template-steps-init-dbt.yml


azure-pipelines-dbt-ci-main.yml

This pipeline is scheduled to run on the main branch and executes dbt commands to build all models from the production branch into a "ci_main" schema within the data warehouse. It is triggered at 4:30 AM UTC from Monday to Friday, ensuring that the latest changes are incorporated into the ci_main schema at the start of the workday. The pipeline runs only if there have been changes to the main branch since the last run.

Once the dbt command has completed, the manifest file is uploaded to a storage account for later use by the CI pipeline.

azure-pipelines-dbt-ci-main.yml


azure-pipelines-dbt-slim-ci.yml

This pipeline runs on every pull request and uses the "ci_main" schema to defer unaltered models in the PR branch during evaluation. This method makes the CI process more efficient by rebuilding and testing only the models affected by the PR. As a result, the overall execution time is significantly reduced, especially for medium or large projects.

The pipeline performs the following steps:

  • Initializes dbt and python
  • Downloads the manifest file generated by azure-pipelines-dbt-ci-main.yml, which references models built in the ci_main shema
  • Runs dbt build and deferring unaltered models using the aforementioned manifest

azure-pipelines-dbt-slim-ci.yml

Configuring Your Azure DevOps Project

The last step is to configure the pipelines in your DevOps project.


Prerequisites

  1. Azure DevOps Account: Ensure you have an Azure DevOps account and the necessary permissions to create and manage branch policies.
  2. Project and Repository: You should have a project and a repository set up in Azure DevOps.
  3. Storage Account: Provision an Azure Storage Account to store the dbt manifest.
  4. Service Connection: Set up a service connection to the Azure Resource Manager (Project -> Settings -> Service Connections) and assign the Storage Blob Data Contributor role to the service principal on your storage account.
  5. Pipeline: Add two build pipeline and point to the yaml files azure-pipelines-dbt-ci-main.yml and azure-pipelines-dbt-slim-ci.yml


Setting up the CI Pipeline

To trigger the CI pipeline when a pull request is created or updated follow these steps:

  1. Navigate to Your Project: Open your Azure DevOps organization and navigate to the project where your repository is located.
  2. Go to Repos: In the left-hand menu, click on Repos to view your repositories.
  3. Select the Repository: Choose the repository for which you want to set up the branch policy.
  4. Branch Policies: Click on Branches in the left-hand menu. Find the branch you want to set the policy on (e.g., main) and click on the ellipsis (...) next to it. Select Branch policies from the dropdown menu.
  5. Add Build Validation: In the Branch policies page, scroll down to the Build validation section. Click on + Add build policy.
  6. Configure Build Policy: Select the pipeline pointing to the azure-pipelines-dbt-ci-main.yml pipeline definition. Choose whether the build is required or optional. Typically, you would set this to required to ensure that the build must pass before the pull request can be completed. Select the trigger conditions. Usually, you would select "Automatically build the pull request" to ensure the build runs whenever a pull request is created or updated. Optionally, specify path filters to control which files or folders trigger the build validation. This is useful to filter out changes that doesn't require dbt validation to occur.
  7. Save the Policy: After configuring the build policy, click Save to apply the policy to the branch.


Setting up the CI Main Pipeline

The CI Main Pipeline ensures that the dbt models from the main branch are built into the ci_main schema and are ready for use by the CI pipeline. Once you add the pipeline definition file to a DevOps pipeline, there isn't much else to do since the triggering behavior is already defined within the pipeline.

Adjusting the Schedule

Instead of running the pipeline daily, you can adjust the schedule to better fit your needs. For example, you could configure it to run only once a week to manage costs associated with the data warehouse.

Considerations for Less Frequent Runs

If you increase the interval between runs, be aware that the CI pipeline will need to verify an increasing number of models over time before a new baseline is created.


Using Azure DevOps Secrets

To securely manage sensitive information like the Databricks token, you can use Azure DevOps secret variables:

  1. Go to your Azure DevOps project.
  2. Navigate to Pipelines > Library.
  3. Create a new variable group or edit an existing one.
  4. Add the necessary variables and mark the sensitive ones as secrets.

In your pipeline YAML, reference the secret variables:

variables:
  - group: your-variable-group        

You can also store secrets in a key vault and access these from the pipeline directly.



Putting the CI Workflow to the Test

Running the ci-main pipeline on the main branch produces the following output. Similar to running it locally we can see that the whole Jaffle Shop project has been built by the pipeline.


If the CI pull request pipeline is triggered and no changes have been made to any models, the below output is produced:


If we repeat the experiment and add a new column to stg_supplies, we again see that the pull request CI pipeline only builds and tests the affected model and it's immediate downstream dependencies.

If any problems are detected the build will fail and the build validation rule will prevent the pull request from being merged.


Closing Thoughts

Thank you for taking the time to read this!

I hope this article has inspired you to set up a CI workflow using dbt Core and Azure DevOps. This setup can help you efficiently scale your data project in both size and complexity without compromising on quality.

Remember, automated quality controls are essential but should not stand alone. Be sure to incorporate other techniques, such as peer reviews, into your development workflow.

There are many ways to achieve this, so feel free to explore and find what works best for you.



dbt Core? and the dbt logo are trademarks of dbt Labs, Inc.

Lucas Lira

Senior Data Engineer | SQL | Python | PySpark | Databricks | Azure | Dbt

3 个月

That's great article!

要查看或添加评论,请登录

Allan Rasmussen的更多文章

社区洞察

其他会员也浏览了