DBT- Data Build Tool (Part I)
Filipe Balseiro
?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!
References
What is dbt?
dbt stands for data build tool. It's a transformation tool that allows us to transform process raw data in our Data Warehouse to transformed data which can be later used by Business Intelligence tools and any other data consumers.
dbt also allows us to introduce good software engineering practices by defining a deployment workflow:
How does dbt work?
dbt works by defining a modeling layer that stands on top of our Data Warehouse. Each table is turned into a model and then transformed into a derived model, that can be stored into the Data Warehouse for persistence.
A model consists in:
How to use dbt?
dbt has 2 main components: dbt Core and dbt Cloud with the following characteristics:
dbt Core: open-source project that allows the data transformation
dbt Cloud: SaaS application to develop and manage dbt projects
How to use dbt?
For this project, I'm integrating with GCP BigQuery so I'll use dbt Cloud IDE. No local installation of dbt is required.
Setting up dbt
You will need to create a dbt cloud using this link and connect to your Data Warehouse following these instructions. More detailed instructions available in this guide.
Developing with dbt
Anatomy of a dbt model
dbt models are a combination of SQL (using SELECT statements) and Jinja templating language to define templates.
Below is an example ofabt model:
{{
config(materialized='table')
}}
SELECT *
FROM staging.source_table
WHERE record_state = 'ACTIVE'
The FROM clause of a dbt model
The?FROM?clause within a?SELECT?statement defines the?sources?of the data to be used.
The following sources are available to dbt models:
Sources: The data loaded within our Data Warehouse.
Seeds: CSV files which can be stored in our repo under the?seeds?folder.
Here's an example of how you would declare a source in a?.yml?file:
sources:
- name: staging
database: production
schema: trips_data_all
loaded_at_field: record_loaded_at
tables:
- name: green_tripdata
- name: yellow_tripdata
freshness:
error_after: {count: 6, period: hour}
And here's how you would reference a source in a?FROM?clause:
FROM {{ source('staging','yellow_tripdata') }}
In the case of seeds, assuming you've got a?taxi_zone_lookup.csv?file in your?seeds?folder which contains?locationid,?borough,?zone?and?service_zone:
SELECT
locationid,
borough,
zone,
replace(service_zone, 'Boro', 'Green') as service_zone
FROM {{ ref('taxi_zone_lookup) }}
The?ref()?function references underlying tables and views in the Data Warehouse. When compiled, it will automatically build the dependencies and resolve the correct schema fo us. So, if BigQuery contains a schema/dataset called?dbt_dev?inside the?my_project?database which we're using for development and it contains a table called?stg_green_tripdata, then the following code...
WITH green_data AS (
SELECT *,
'Green' AS service_type
FROM {{ ref('stg_green_tripdata') }}
),
...will compile to this:
WITH green_data AS (
SELECT *,
'Green' AS service_type
FROM "my_project"."dbt_dev"."stg_green_tripdata"
),
Defining a source and creating a model
It's time to create our first model.
We will begin by creating 2 new folders under our?models?folder:
Under?staging?we will add 2 new files:?sgt_green_tripdata.sql?and?schema.yml:
# schema.yml
version: 2
sources:
- name: staging
database: your_project
schema: trips_data_all
tables:
- name: green_tripdata
- name: yellow_tripdata
-- sgt_green_tripdata.sql
{{ config(materialized='view') }}
select * from {{ source('staging', 'green_tripdata') }}
limit 100
The advantage of having the properties in a separate file is that we can easily modify the?schema.yml?file to change the database details and write to different databases without having to modify our?sgt_green_tripdata.sql?file.
You may know run the model with the?dbt run?command, either locally or from dbt Cloud.
Macros
Macros?are pieces of code in Jinja that can be reused, similar to functions in other languages.
dbt already includes a series of macros like?config(),?source()?and?ref(), but custom macros can also be defined.
Macros allow us to add features to SQL that aren't otherwise available, such as:
Macros are defined in separate?.sql?files which are typically stored in a?macros?directory.
There are 3 kinds of Jinja?delimiters:
Here's a macro definition example:
{# This macro returns the description of the payment_type #}
{% macro get_payment_type_description(payment_type) %}
case {{ payment_type }}
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end
{% endmacro %}
Here's how we use the macro:
select
{{ get_payment_type_description('payment-type') }} as payment_type_description,
congestion_surcharge::double precision
from {{ source('staging','green_tripdata') }}
where vendorid is not null
And this is what it would compile to:
select
case payment_type
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end as payment_type_description,
congestion_surcharge::double precision
from {{ source('staging','green_tripdata') }}
where vendorid is not null
Packages
Macros can be exported to?packages, similarly to how classes and functions can be exported to libraries in other languages. Packages contain standalone dbt projects with models and macros that tackle a specific problem area.
When you add a package to your project, the package's models and macros become part of your own project. A list of useful packages can be found in the?dbt package hub.
To use a package, you must first create a?packages.yml?file in the root of your work directory. Here's an example:
packages:
- package: dbt-labs/dbt_utils
version: 0.8.0
After declaring your packages, you need to install them by running the?dbt deps?command either locally or on dbt Cloud.
You may access macros inside a package in a similar way to how Python access class methods:
select
{{ dbt_utils.surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,
cast(vendorid as integer) as vendorid,
-- ...
Variables
Like most other programming languages,?variables?can be defined and used across our project.
Variables can be defined in 2 different ways:
vars:
payment_type_values: [1, 2, 3, 4, 5, 6]
dbt build --m <your-model.sql> --var 'is_test_run: false'
Variables can be used with the?var()?macro. For example:
{% if var('is_test_run', default=true) %}
limit 100
{% endif %}
Founder of DataTalks.Club | dstack ambassador
2 年Alvaro's notes are the best!
Practice Director, Data & Intelligence | Strategy & Execution | Executive Team Leadership | Go-to-Market | Offering Strategy & Development | Alliances & Partnerships
2 年Keep going Filipe!