DBT- Data Build Tool (Part II)
Filipe Balseiro
?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!
References
Data Engineering Zoomcamp Repository
Testing and documenting dbt models
Although testing and documenting are not required steps to successfully run models, it's a good practice to do that and they are expected in any professional environment.
Testing
Tests in dbt are assumptions that we make about our data.
In dbt, tests are essentially a SELECT statement that will return the amount of records that fail because they do not follow the assumption defined by the test.
Tests are defined on a column in the model YAML files (like the schema.yml file we defined before). dbt provides a few predefined tests to check column values but custom tests can also be created as queries. Here's an example test:
models: - name: stg_green_tripdata description: > Trip made by green taxis, also known as boro taxis and street-hail liveries. Green taxis may respond to street hails,but only in the areas indicated in green on the map (i.e. above W 110 St/E 96th St in Manhattan and in the boroughs). The records were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology service providers. columns: - name: tripid description: Primary key for this table, generated with a concatenation of vendorid+pickup_datetime tests: - unique: severity: warn - not_null: severity: warn
- The tests are defined for a column in a specific table for a specific model.
- There are 2 tests in this YAML file: unique and not_null. Both are predefined by dbt.
- unique checks whether all the values in the tripid column are unique.
- not_null checks whether all the values in the tripid column are not null.
- Both tests will return a warning in the command line interface if they detect an error.
Here's what the not_null will compile to in SQL query form:
select * from "my_project"."dbt_dev"."stg_green_tripdata"
You may run tests with the dbt test command.
Documentation
dbt also provides a way to generate documentation for your dbt project and render it as a website.
You may have noticed in the previous code block that a description: field can be added to the YAML field. dbt will make use of these fields to gather info.
The dbt generated docs will include the following:
- Information about the project:
- Model code (both from the .sql files and compiled code)
- Model dependencies
- Sources
- Auto generated DAGs from the ref() and source() macros
- Descriptions from the .yml files and tests
- Information about the Data Warehouse (information_schema):
- Column names and data types
- Table stats like size and rows
dbt docs can be generated on the cloud or locally with dbt docs generate, and can be hosted in dbt Cloud as well or on any other webserver with dbt docs serve.
Deployment of a dbt project
Deployment basics
The goal of dbt is to introduce good software engineering practices by defining a deployment workflow.
So far we've seen the Develop and "Test and Document" stages of the workflow. We will now cover deployment.
Deployment is the process of running the models we created in our development environment in a production environment. Separating the development and production environments allows us to continue building and testing models without affecting the models in production.
Normally, a production environment will have a different schema in our Data Warehouse and ideally a different user.
The deployment workflow defines the steps used to create a model from scratch and bring it to production. Here's a deployment workflow example:
- Develop in a user branch.
- Open a PR to merge into the main branch.
- Merge the user branch to the main branch.
- Run the new models in the production environment using the main branch.
- Schedule the models.
dbt projects are usually deployed in the form of jobs:
- A job is a collection of commands such as build or test. A job may contain one or more commands.
- Jobs can be triggered manually or on schedule.
- dbt Cloud has a scheduler which can run jobs for us, but other tools such as Airflow or cron can be used as well.
- Each job will keep a log of the runs over time, and each run will keep the logs for each command.
- A job may also be used to generate documentation, which may be viewed under the run information.
- If the dbt source freshness command was run, the results can also be viewed at the end of a job.
Continuous Integration
Another good software engineering practice that dbt enables is Continuous Integration (CI): the practice of regularly merging development branches into a central repository, after which automated builds and tests are run. The goal of CI is to reduce adding bugs to the production code and maintain a more stable project.
CI is built on jobs: a CI job will do things such as build, test, etc. We can define CI jobs which can then be triggered under certain circunstances to enable CI.
dbt makes use of GitHub/GitLab's Pull Requests to enable CI via webhooks. When a PR is ready to be merged, a webhook is received in dbt Cloud that will enqueue a new run of a CI job. This run will usually be against a temporary schema that has been created explicitly for the PR. If the job finishes successfully, the PR can be merged into the main branch, but if it fails the merge will not happen.
CI jobs can also be scheduled with the dbt Cloud scheduler, Airflow, cron and a number of additional tools.
Deployment using dbt Cloud
In dbt Cloud, you might have noticed that after the first commit, the main branch becomes read-only and forces us to create a new branch if we want to keep developing. dbt Cloud does this to enforce us to open PRs for CI purposes rather than allowing merging to main straight away.
In order to properly establish a deployment workflow, we must define environments within dbt Cloud. In the sidebar, under Environments, you will see that a default Development environment is already generated, which is the one we've been using so far.
We will create a new Production environment of type Deployment using the latest stable dbt version. By default, the environment will use the main branch of the repo but you may change it for more complex workflows. If you used the JSON credentials when setting up dbt Cloud then most of the deployment credentials should already be set up except for the dataset. For this example, we will use the production dataset (make sure that the production dataset/schema exists in your BigQuery project).
The dbt Cloud scheduler is available in the Jobs menu in the sidebar. We will create a new job with name dbt build using the Production environment, we will check the Generate docs? checkbox. Add the following commands:
- dbt seed
- dbt run
- dbt test
In the Schedule tab at the bottom we will check the Run on schedule? checkbox with a timing of Every day and every 6 hours. Save the job. You will be shown the job's run history screen which contains a Run now button that allows us to trigger the job manually; do so to check that the job runs successfully.
You can access the run and check the current state of it as well as the logs. After the run is finished, you will see a View Documentation button at the top; clicking on it will open a new browser window/tab with the generated docs.
Under Account settings > Projects, you may edit the project in order to modify the Documentation field under Artifacts; you should see a drop down menu which should contain the job we created which generates the docs. After saving the changes and reloading the dbt Cloud website, you should now have a Documentation section in the sidebar.
Machine Learning / Data Science
2 年I'm super glad that my notes are useful to anyone else besides me! Thank you for the shout out!