登录查看更多内容

Day 18: Using dbt with GitLab CI/CD Pipeline

Surya Ambati

Lead Analyst at CRISIL Global Research & Analytics

发布日期: 2024年10月2日

In today’s article, we’ll delve into integrating dbt with GitLab's CI/CD pipeline, a crucial step for automating dbt workflows and ensuring that your data models go through a robust development lifecycle. We’ll cover setting up a directory structure for dbt and GitLab CI, defining environment variables for different stages (test, build, deploy), and walking through an example of a CI/CD configuration file. Let's get started!

Directory Structure

A well-organized directory structure ensures that your dbt project and GitLab CI/CD pipelines work seamlessly together. Here’s an example structure that keeps everything tidy:

#bash

├── .gitlab-ci.yml             # CI/CD configuration file
├── dbt_project.yml            # dbt project configuration
├── profiles.yml               # dbt connection profiles (excluded from version control)
├── models                     # Folder for dbt models
│   ├── staging                # Staging models
│   └── marts                  # Business logic models
├── seeds                      # Static data
├── snapshots                  # For snapshots
├── tests                      # For dbt tests
└── ci-scripts                 # Custom scripts for CI/CD if needed

dbt_project.yml: This is your dbt project’s configuration file. It contains project-level settings like where your models are stored and how they’re structured.
profiles.yml: This file holds your connection details to different environments (dev, staging, prod). It should be kept out of version control for security reasons.
.gitlab-ci.yml: This file defines the pipeline stages and jobs for GitLab CI/CD.
ci-scripts/: This folder contains any custom scripts that may be required in the CI/CD process.

Setting Up GitLab CI/CD Pipeline

1. Defining Environment Variables for Different Stages

Environment variables are critical for securely passing credentials, dbt profiles, and configurations into the CI/CD pipeline. GitLab CI allows defining variables at the project or group level, which are then available during pipeline execution.

Go to Settings > CI/CD in GitLab.
Under the Variables section, add environment-specific variables such as database credentials, schema names, or dbt profile paths. Here’s a typical example of variables you might define:

2. The .gitlab-ci.yml File

The .gitlab-ci.yml file orchestrates the CI/CD pipeline by defining stages (e.g., test, build, deploy) and the jobs within each stage.

Here’s a basic example for dbt:

#yaml file:

stages:
  - test
  - build
  - deploy

# Test Stage: Run dbt tests on the dev environment
test:
  stage: test
  image: fivetran/dbt:latest
  script:
    - export DBT_ENV="dev"
    - dbt deps  # Install dependencies
    - dbt seed --profiles-dir $DBT_PROFILES_DIR --target $DBT_TARGET
    - dbt run --profiles-dir $DBT_PROFILES_DIR --target $DBT_TARGET
    - dbt test --profiles-dir $DBT_PROFILES_DIR --target $DBT_TARGET
  only:
    - merge_requests  # Run tests only on MRs
  variables:
    DBT_PROFILES_DIR: $CI_PROJECT_DIR/profiles  # Path to the dbt profiles directory
    DBT_TARGET: "dev"  # Run against the dev environment

# Build Stage: Build models in the staging environment
build:
  stage: build
  image: fivetran/dbt:latest
  script:
    - export DBT_ENV="staging"
    - dbt run --profiles-dir $DBT_PROFILES_DIR --target $DBT_TARGET
  only:
    - develop  # Run on the develop branch
  variables:
    DBT_TARGET: "staging"

# Deploy Stage: Deploy models in the production environment
deploy:
  stage: deploy
  image: fivetran/dbt:latest
  script:
    - export DBT_ENV="prod"
    - dbt run --profiles-dir $DBT_PROFILES_DIR --target $DBT_TARGET
  only:
    - main  # Run only on the main branch
  variables:
    DBT_TARGET: "prod"

领英推荐

? Loss of 2 masters, Optimize startup time using…

Learnk8s 5 个月前

Learn Kubernetes weekly — issue 8

Learnk8s 2 年前

? Basics of observing Kubernetes, From blue to green:…

Learnk8s 9 个月前

3. Explanation of the .gitlab-ci.yml Components

Stages: We’ve defined three stages: test, build, and deploy. Each stage corresponds to a step in the CI/CD pipeline.
Image: The pipeline uses the fivetran/dbt Docker image, which comes pre-installed with dbt.
Script: Each job runs a set of dbt commands:dbt deps: Install the dependencies specified in packages.yml.dbt seed: Load seed data (useful for static datasets).dbt run: Build and materialize models.dbt test: Run data tests to ensure model quality.
Only: We’ve set only for specific branches, so tests only run on merge requests, builds on the develop branch, and deployments on the main branch.
Variables: We use environment variables for flexibility across environments (e.g., DBT_PROFILES_DIR, DBT_TARGET).

4. Setting Up Different Environments

In dbt, environments are defined by the profiles.yml file. This file typically contains configurations for dev, staging, and production environments. Here’s an example:

#yaml:

my_project:
  outputs:
    dev:
      type: postgres
      host: dev-database-host
      user: db_user
      password: db_password
      dbname: my_database
      schema: dev_schema
    staging:
      type: postgres
      host: staging-database-host
      user: db_user
      password: db_password
      dbname: my_database
      schema: staging_schema
    prod:
      type: postgres
      host: prod-database-host
      user: db_user
      password: db_password
      dbname: my_database
      schema: prod_schema
  target: dev  # Default target

In GitLab CI, we can control which target is used (e.g., dev, staging, prod) by passing the DBT_TARGET variable. The pipeline stages use this to switch environments.

Example Workflow

Testing: The pipeline triggers the test stage on every merge request, where dbt tests are run against the development environment.
Building: Once the code is merged into the develop branch, the build stage runs, deploying the models to the staging environment.
Deploying: When the code is ready and merged into the main branch, the deploy stage is triggered, running the dbt models in the production environment.

Best Practices

Use Environment-Specific Credentials: Ensure sensitive credentials like database passwords are stored securely as GitLab environment variables.
Run Tests Early: Running dbt tests in the CI pipeline helps catch issues early before they reach production.
Monitor Pipelines: Keep an eye on your GitLab pipelines for failures and address them immediately to maintain data integrity.

Conclusion

Setting up dbt with GitLab CI/CD enables automated, consistent, and reliable deployment of your data models. By structuring your directory properly, defining environment variables, and setting up stages in .gitlab-ci.yml, you can manage the lifecycle of dbt models effectively across dev, staging, and production environments.

With this pipeline in place, your team can confidently iterate on data models while maintaining high standards for testing and deployment.

#dbt #gitlab #cicd #DataEngineering

要查看或添加评论，请登录

Surya Ambati的更多文章

Day 19: Handling Errors in dbt

2024年10月6日

Day 19: Handling Errors in dbt

Debugging and troubleshooting in dbt (data build tool) is an essential skill for data engineers and analysts. As you…
Day 17: Incremental Models in dbt - Efficient Data Processing for Large Datasets

2024年9月14日

Day 17: Incremental Models in dbt - Efficient Data Processing for Large Datasets

Working with large datasets in modern data pipelines can quickly become overwhelming, especially when processing times…
Day 16: Using dbt Macros to Simplify Your Data Transformations

2024年9月11日

Day 16: Using dbt Macros to Simplify Your Data Transformations

In the world of data transformation, we often encounter repetitive tasks that slow us down and introduce unnecessary…
C++ Exercise 2: What is Concurrency and Why is it Important?

2024年9月10日

C++ Exercise 2: What is Concurrency and Why is it Important?

Concurrency in programming refers to the ability of a system to execute multiple tasks or processes simultaneously. In…
Day 15: dbt Testing – Ensuring Data Quality with Built-in Tests

2024年9月9日

Day 15: dbt Testing – Ensuring Data Quality with Built-in Tests

As you build out your dbt (Data Build Tool) models, it’s critical to ensure that your data remains accurate, reliable…
Article 3 - Mastering C++ : Understanding Conditional Logic such as Comparison Operators, Logical Operators, and Control Statements

2024年9月9日

Article 3 - Mastering C++ : Understanding Conditional Logic such as Comparison Operators, Logical Operators, and Control Statements

In C++, mastering control flow and decision-making logic is essential to writing efficient and dynamic programs. Today,…
Day 14: Advanced Jinja Techniques in dbt

2024年9月8日

Day 14: Advanced Jinja Techniques in dbt

As we progress deeper into dbt (Data Build Tool), we begin to unlock more advanced functionalities that significantly…
Article 2: Mastering C++ Basics - Naming Conventions, Constants, Input Handling, Arrays, and Type Casting

2024年9月8日

Article 2: Mastering C++ Basics - Naming Conventions, Constants, Input Handling, Arrays, and Type Casting

C++ is a powerful and widely-used programming language that offers both performance and flexibility. Whether you're…
Day 13: Building Modular dbt Projects

2024年9月6日

Day 13: Building Modular dbt Projects

As your data team and needs grow, so does the complexity of your dbt projects. One of the best ways to manage this…
C++ Exercise 1 Continuation: Mastering Bitwise XOR Operations

2024年9月6日

C++ Exercise 1 Continuation: Mastering Bitwise XOR Operations

Bitwise XOR (exclusive OR) is a binary operation that works on the individual bits of binary numbers. Here’s a detailed…

See all articles

Day 18: Using dbt with GitLab CI/CD Pipeline

Surya Ambati

Lead Analyst at CRISIL Global Research & Analytics

Directory Structure

Setting Up GitLab CI/CD Pipeline

1. Defining Environment Variables for Different Stages

2. The .gitlab-ci.yml File

领英推荐

3. Explanation of the .gitlab-ci.yml Components

4. Setting Up Different Environments

Example Workflow

Best Practices

Conclusion

Surya Ambati的更多文章

社区洞察

其他会员也浏览了

CI/CD Pipelines and Caching of Dependencies

K8s Exercise Labels and Annotations

#tipoftheweek: How to Filter Kubectl Output Using Label Selectors

Docker

Integration Digest for December 2024

Continuous Delivery with Flux CD

CI/CD: Streamlining Deployments with GitHub Actions

Day 17 : Docker Overview: Understanding the Core Concepts

The Forgotten R: Why You Should Focus On Replatforming Now

Git's Delta Compression Algorithm: Technical Deep Dive

Directory Structure

Setting Up GitLab CI/CD Pipeline

1. Defining Environment Variables for Different Stages

2. The .gitlab-ci.yml File

领英推荐

3. Explanation of the .gitlab-ci.yml Components

4. Setting Up Different Environments

Example Workflow

Best Practices

Conclusion

Surya Ambati的更多文章

Day 19: Handling Errors in dbt

Day 17: Incremental Models in dbt - Efficient Data Processing for Large Datasets

Day 16: Using dbt Macros to Simplify Your Data Transformations

C++ Exercise 2: What is Concurrency and Why is it Important?

Day 15: dbt Testing – Ensuring Data Quality with Built-in Tests

Article 3 - Mastering C++ : Understanding Conditional Logic such as Comparison Operators, Logical Operators, and Control Statements

Day 14: Advanced Jinja Techniques in dbt

Article 2: Mastering C++ Basics - Naming Conventions, Constants, Input Handling, Arrays, and Type Casting

Day 13: Building Modular dbt Projects

C++ Exercise 1 Continuation: Mastering Bitwise XOR Operations

社区洞察

其他会员也浏览了

CI/CD Pipelines and Caching of Dependencies

K8s Exercise Labels and Annotations

#tipoftheweek: How to Filter Kubectl Output Using Label Selectors

Docker

Integration Digest for December 2024

Continuous Delivery with Flux CD

CI/CD: Streamlining Deployments with GitHub Actions

Day 17 : Docker Overview: Understanding the Core Concepts

The Forgotten R: Why You Should Focus On Replatforming Now

Git's Delta Compression Algorithm: Technical Deep Dive