Slim CI with dbt Core: Efficient Pipelines Using Azure DevOps
About the GitHub Repository
The sample code mentioned in this article is available in the Git repository linked above. This repository is a fork of Jaffle Shop, an open-source sandbox dbt project that is useful for demonstrating and experimenting with dbt workflows. If you want to test the project in your Azure DevOps environment, you can fork the repository and add it, along with the DevOps pipelines, to your Azure DevOps project.
The examples in this article are based on the Jaffle Shop data models for convenience. I have used Databricks as the data platform in my examples, but the code can easily be adapted to any dbt supported data platform such as Microsoft Fabric, Snowflake, BigQuery etc.
Introduction
In today's data operations, the ability to quickly identify and transform business requirements into business value is essential in an ever-changing competitive landscape.
This mindset has been prevalent for years within the field of software development, where we have seen the emergence of methodologies like DevOps and best practices such as CI/CD.
At the core of these concepts is the fundamental motivation to reduce the lead-time between identifying a requirement and delivering value. In the data space, we have also seen various tools and platforms begin to prioritize this approach, enabling data teams to apply the principles of DevOps (or DataOps) and CI/CD.
One of these tools is dbt by dbt Labs, Inc. In this article, I will explore how to harness the flexibility provided by the open source version of dbt, dbt Core, to create an efficient fully self-contained CI pipeline with Azure DevOps.
An in-depth explanation of dbt Core and its capabilities for your data project is beyond the scope of this piece. However, I have briefly discussed dbt Core as a tool in a previous article.
CI/CD Pipeline
For data teams using dbt-core, the CI/CD pipeline is typically divided into several stages. The below figure illustrates one such example of a CI/CD pipeline workflow:
Slim CI Explained
A slim CI pipeline tries to narrow the scope of verifying a proposed code change, to only include code artifacts for inspection that have been affected by the branch or pull request being tested.
In a dbt context, Slim CI means running and testing only the dbt models directly impacted by the changes made by the developer. This includes:
All things being equal, there is little reason to run any additional models in the project, as there is a very high chance that any defects introduced by the proposed change will be detected by running the narrowly scoped selection. Models further upstream or downstream should be unaffected, since neither they nor their immediate parent/child models have changed.
This approach is particularly useful for medium or large-sized dbt projects with a certain degree of complexity. It is also beneficial if you have data models that require significant time or resources to build, as it allows you to limit the amount of data warehouse compute used in your CI pipelines by skipping models unaffected by the code changes.
dbt Cloud has built-in support for slim CI, so you can quickly get up and running in this scenario. With dbt Core however, there is a bit of work to do before you are able to enjoy the fruits of a streamlined CI pipeline!
Slim CI With the dbt CLI
The dbt CLI command tool lends itself very well to incorporate into CI pipelines. In particular there are two critically important CLI arguments needed to support our pipeline:
To start, let's first try to build the Jaffle Shop project to get a baseline for the next steps. If we try to run dbt build from the command line on the Jaffle Shop project, we see the following output:
dbt build
In total, 46 operations have been executed, split between models, unit tests, data tests and saved queries. We can see that the models in this example are built into the schema ci_main.
Now that all the models have been built in the data warehouse, let's save the dbt manifest previously generated into a folder called ci_main.
This json-based manifest contains all the dbt artifacts with references to the ci_main schema we just used. If we look at the dbt lineage graph via dbt docs, we can see the following DAG:
If we then zoom in on stg_supplies, we can see that there are only two direct downstream dependencies namely supplies and order_items:
Let us now try to run dbt build again with the following changes:
dbt build --select "state:modified+1" --state "ci_main" --defer
The state selector state:modified compared our current manifest with the ci_main manifest, and correctly comes to the conclusion that no changes exist.
The next step is to make a dummy change to stg_supplies by adding a new column to the model, and then re-running the previous command.
dbt build --select "state:modified+1" --state "ci_main" --defer
This time, dbt correctly detected that stg_supplies was changed and selected it along with its +1 downstream dependencies, supplies and order_items. The additional dependencies needed to build order_items were referenced via the ci_main schema and will not be processed in this run.
This can dramatically reduce the time it takes to validate a change for larger data projects.
Azure DevOps Pipeline Implementation
Now that the underlying mechanisms in dbt for supporting slim CI have been established, it's time to operationalize it into a proper CI workflow using Azure DevOps pipelines.
Here’s how a slim CI setup could be implemented from an Azure DevOps perspective:
Note: This article assumes a strict separation between development, CI (or QA), and production environments. In practice, this means that data models do not cross environment boundaries, and each environment builds dbt models specifically for itself.
Of course, you can adapt these environments to suit your specific workflow.
Let's examine the workflows involved:
Developer workflow: A developer creates a feature branch and starts working on the implementation, committing code changes along the way (1). During development, dbt models are executed using the local development environment (command line tool), and the models are materialized in the development data warehouse (DWH).
领英推荐
CI Workflow: When the developer creates a pull request, a CI pipeline is triggered (2) to run the affected dbt models and any associated tests. If errors are discovered, the pipeline fails, and the developer must make changes to address the issues. The pipeline runs the dbt models against a CI environment and defers any unaffected models to the models residing in the ci_main schema. For this to work, a manifest containing the models present in the ci_main schema is downloaded from a storage account.
A second pipeline (3) is configured to react to any changes merged into the main branch. This pipeline builds the entire dbt project, including all models, into the ci_main schema in the CI environment. The dbt manifest generated from the main branch is then uploaded to a Storage Account. This pipeline is best set to a schedule-based trigger, running late at night or early in the morning based on all the changes merged into the main branch for that workday. If no changes are merged for a given workday, the pipeline isn't run, conserving compute resources on the data platform.
CD Workflow: When the proposed changes have passed all CI quality gates, including a peer review, the pull request can be merged into the main (production) branch. This triggers a deployment pipeline (4) that pushes the changes into production (not covered by this article).
The CI workflow is supported by two pipelines:
The basic file structure looks like this:
dbt requires a profile to connect to your data platform. In this example, we'll use Databricks. Instead of hard-coding the connection values and committing them to git, you can use Jinja templating to reference environment variables. This approach allows you to securely manage sensitive information and easily switch between environments.
This pipeline template includes the necessary initialization steps to run dbt. It covers installing the correct Python version, the Python package handler pip, and the dbt-databricks adapter (which includes dbt Core).
It is referenced by both azure-pipelines-dbt-ci-main.yml and azure-pipelines-dbt-slim-ci.yml (see below)
This pipeline is scheduled to run on the main branch and executes dbt commands to build all models from the production branch into a "ci_main" schema within the data warehouse. It is triggered at 4:30 AM UTC from Monday to Friday, ensuring that the latest changes are incorporated into the ci_main schema at the start of the workday. The pipeline runs only if there have been changes to the main branch since the last run.
Once the dbt command has completed, the manifest file is uploaded to a storage account for later use by the CI pipeline.
This pipeline runs on every pull request and uses the "ci_main" schema to defer unaltered models in the PR branch during evaluation. This method makes the CI process more efficient by rebuilding and testing only the models affected by the PR. As a result, the overall execution time is significantly reduced, especially for medium or large projects.
The pipeline performs the following steps:
Configuring Your Azure DevOps Project
The last step is to configure the pipelines in your DevOps project.
Prerequisites
Setting up the CI Pipeline
To trigger the CI pipeline when a pull request is created or updated follow these steps:
Setting up the CI Main Pipeline
The CI Main Pipeline ensures that the dbt models from the main branch are built into the ci_main schema and are ready for use by the CI pipeline. Once you add the pipeline definition file to a DevOps pipeline, there isn't much else to do since the triggering behavior is already defined within the pipeline.
Adjusting the Schedule
Instead of running the pipeline daily, you can adjust the schedule to better fit your needs. For example, you could configure it to run only once a week to manage costs associated with the data warehouse.
Considerations for Less Frequent Runs
If you increase the interval between runs, be aware that the CI pipeline will need to verify an increasing number of models over time before a new baseline is created.
Using Azure DevOps Secrets
To securely manage sensitive information like the Databricks token, you can use Azure DevOps secret variables:
In your pipeline YAML, reference the secret variables:
variables:
- group: your-variable-group
You can also store secrets in a key vault and access these from the pipeline directly.
Putting the CI Workflow to the Test
Running the ci-main pipeline on the main branch produces the following output. Similar to running it locally we can see that the whole Jaffle Shop project has been built by the pipeline.
If the CI pull request pipeline is triggered and no changes have been made to any models, the below output is produced:
If we repeat the experiment and add a new column to stg_supplies, we again see that the pull request CI pipeline only builds and tests the affected model and it's immediate downstream dependencies.
If any problems are detected the build will fail and the build validation rule will prevent the pull request from being merged.
Closing Thoughts
Thank you for taking the time to read this!
I hope this article has inspired you to set up a CI workflow using dbt Core and Azure DevOps. This setup can help you efficiently scale your data project in both size and complexity without compromising on quality.
Remember, automated quality controls are essential but should not stand alone. Be sure to incorporate other techniques, such as peer reviews, into your development workflow.
There are many ways to achieve this, so feel free to explore and find what works best for you.
dbt Core? and the dbt logo are trademarks of dbt Labs, Inc.
Senior Data Engineer | SQL | Python | PySpark | Databricks | Azure | Dbt
3 个月That's great article!