DBT- Data Build Tool (Part II)

DBT- Data Build Tool (Part II)

References

Alvaro Navas Notes

Data Engineering Zoomcamp Repository

Testing and documenting dbt models

Although testing and documenting are not required steps to successfully run models, it's a good practice to do that and they are expected in any professional environment.

Testing

Tests in dbt are assumptions that we make about our data.

In dbt, tests are essentially a SELECT statement that will return the amount of records that fail because they do not follow the assumption defined by the test.

Tests are defined on a column in the model YAML files (like the schema.yml file we defined before). dbt provides a few predefined tests to check column values but custom tests can also be created as queries. Here's an example test:

models:
    - name: stg_green_tripdata
      description: >
        Trip made by green taxis, also known as boro taxis and street-hail liveries.
        Green taxis may respond to street hails,but only in the areas indicated in green on the
        map (i.e. above W 110 St/E 96th St in Manhattan and in the boroughs).
        The records were collected and provided to the NYC Taxi and Limousine Commission (TLC) by
        technology service providers. 
      columns:
          - name: tripid
            description: Primary key for this table, generated with a concatenation of vendorid+pickup_datetime
            tests:
                - unique:
                    severity: warn
                - not_null:
                    severity: warn
  • The tests are defined for a column in a specific table for a specific model.
  • There are 2 tests in this YAML file: unique and not_null. Both are predefined by dbt.
  • unique checks whether all the values in the tripid column are unique.
  • not_null checks whether all the values in the tripid column are not null.
  • Both tests will return a warning in the command line interface if they detect an error.

Here's what the not_null will compile to in SQL query form:

select *
from "my_project"."dbt_dev"."stg_green_tripdata"

You may run tests with the dbt test command.

Documentation

dbt also provides a way to generate documentation for your dbt project and render it as a website.

You may have noticed in the previous code block that a description: field can be added to the YAML field. dbt will make use of these fields to gather info.

The dbt generated docs will include the following:

  • Information about the project:
  • Model code (both from the .sql files and compiled code)
  • Model dependencies
  • Sources
  • Auto generated DAGs from the ref() and source() macros
  • Descriptions from the .yml files and tests
  • Information about the Data Warehouse (information_schema):
  • Column names and data types
  • Table stats like size and rows

dbt docs can be generated on the cloud or locally with dbt docs generate, and can be hosted in dbt Cloud as well or on any other webserver with dbt docs serve.

Deployment of a dbt project

Deployment basics

The goal of dbt is to introduce good software engineering practices by defining a deployment workflow.

No alt text provided for this image

So far we've seen the Develop and "Test and Document" stages of the workflow. We will now cover deployment.

Deployment is the process of running the models we created in our development environment in a production environment. Separating the development and production environments allows us to continue building and testing models without affecting the models in production.

Normally, a production environment will have a different schema in our Data Warehouse and ideally a different user.

The deployment workflow defines the steps used to create a model from scratch and bring it to production. Here's a deployment workflow example:

  1. Develop in a user branch.
  2. Open a PR to merge into the main branch.
  3. Merge the user branch to the main branch.
  4. Run the new models in the production environment using the main branch.
  5. Schedule the models.

dbt projects are usually deployed in the form of jobs:

  • job is a collection of commands such as build or test. A job may contain one or more commands.
  • Jobs can be triggered manually or on schedule.
  • dbt Cloud has a scheduler which can run jobs for us, but other tools such as Airflow or cron can be used as well.
  • Each job will keep a log of the runs over time, and each run will keep the logs for each command.
  • A job may also be used to generate documentation, which may be viewed under the run information.
  • If the dbt source freshness command was run, the results can also be viewed at the end of a job.

Continuous Integration

Another good software engineering practice that dbt enables is Continuous Integration (CI): the practice of regularly merging development branches into a central repository, after which automated builds and tests are run. The goal of CI is to reduce adding bugs to the production code and maintain a more stable project.

CI is built on jobs: a CI job will do things such as build, test, etc. We can define CI jobs which can then be triggered under certain circunstances to enable CI.

dbt makes use of GitHub/GitLab's Pull Requests to enable CI via webhooks. When a PR is ready to be merged, a webhook is received in dbt Cloud that will enqueue a new run of a CI job. This run will usually be against a temporary schema that has been created explicitly for the PR. If the job finishes successfully, the PR can be merged into the main branch, but if it fails the merge will not happen.

CI jobs can also be scheduled with the dbt Cloud scheduler, Airflow, cron and a number of additional tools.

Deployment using dbt Cloud

In dbt Cloud, you might have noticed that after the first commit, the main branch becomes read-only and forces us to create a new branch if we want to keep developing. dbt Cloud does this to enforce us to open PRs for CI purposes rather than allowing merging to main straight away.

In order to properly establish a deployment workflow, we must define environments within dbt Cloud. In the sidebar, under Environments, you will see that a default Development environment is already generated, which is the one we've been using so far.

We will create a new Production environment of type Deployment using the latest stable dbt version. By default, the environment will use the main branch of the repo but you may change it for more complex workflows. If you used the JSON credentials when setting up dbt Cloud then most of the deployment credentials should already be set up except for the dataset. For this example, we will use the production dataset (make sure that the production dataset/schema exists in your BigQuery project).

The dbt Cloud scheduler is available in the Jobs menu in the sidebar. We will create a new job with name dbt build using the Production environment, we will check the Generate docs? checkbox. Add the following commands:

  1. dbt seed
  2. dbt run
  3. dbt test

In the Schedule tab at the bottom we will check the Run on schedule? checkbox with a timing of Every day and every 6 hours. Save the job. You will be shown the job's run history screen which contains a Run now button that allows us to trigger the job manually; do so to check that the job runs successfully.

You can access the run and check the current state of it as well as the logs. After the run is finished, you will see a View Documentation button at the top; clicking on it will open a new browser window/tab with the generated docs.

Under Account settings > Projects, you may edit the project in order to modify the Documentation field under Artifacts; you should see a drop down menu which should contain the job we created which generates the docs. After saving the changes and reloading the dbt Cloud website, you should now have a Documentation section in the sidebar.

Alvaro Navas Peire

Machine Learning / Data Science

2 年

I'm super glad that my notes are useful to anyone else besides me! Thank you for the shout out!

要查看或添加评论,请登录

Filipe Balseiro的更多文章

  • Introduction to Streaming - Apache Kafka

    Introduction to Streaming - Apache Kafka

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…

  • Spark - Setting up a Dataproc Cluster on GCP

    Spark - Setting up a Dataproc Cluster on GCP

    Dataproc is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto…

    6 条评论
  • Apache Spark

    Apache Spark

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

    3 条评论
  • DBT- Data Build Tool (Part I)

    DBT- Data Build Tool (Part I)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

    3 条评论
  • BigQuery

    BigQuery

    Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…

  • DataCamp - Data Engineering with Python

    DataCamp - Data Engineering with Python

    Data Engineers Data engineers deliver: The correct data In the right form To the right people As efficiently as…

  • Youtubers Popularity

    Youtubers Popularity

    Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

    12 条评论
  • Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Case Study: Help a bike-share company to convert casual riders into annual members In this article I showcase my…

社区洞察

其他会员也浏览了