DBT- Data Build Tool (Part I)

DBT- Data Build Tool (Part I)

References

Alvaro Navas Notes

Data Engineering Zoomcamp Repository

What is dbt?

dbt stands for data build tool. It's a transformation tool that allows us to transform process raw data in our Data Warehouse to transformed data which can be later used by Business Intelligence tools and any other data consumers.

dbt also allows us to introduce good software engineering practices by defining a deployment workflow:

  1. Develop models
  2. Test and document models
  3. Deploy models with version control and CI/CD.

How does dbt work?

dbt works by defining a modeling layer that stands on top of our Data Warehouse. Each table is turned into a model and then transformed into a derived model, that can be stored into the Data Warehouse for persistence.

A model consists in:

  1. A *.sql file
  2. Select statement, no DDL or DML are used
  3. A file that dbt will compile and run in our Data Warehouse

How to use dbt?

dbt has 2 main components: dbt Core and dbt Cloud with the following characteristics:

dbt Core: open-source project that allows the data transformation

  • Builds and runs a dbt project (.sql and .yaml files).
  • Includes SQL compilation logic, macros and database adapters.
  • Includes a CLI interface to run dbt commands locally.
  • Open-source and free to use.

dbt Cloud: SaaS application to develop and manage dbt projects

  • Web-based IDE to develop, run and test a dbt project.
  • Jobs orchestration.
  • Logging and alerting.
  • Intregrated documentation.
  • Free for individuals (one developer seat).

How to use dbt?

For this project, I'm integrating with GCP BigQuery so I'll use dbt Cloud IDE. No local installation of dbt is required.

Setting up dbt

You will need to create a dbt cloud using this link and connect to your Data Warehouse following these instructions. More detailed instructions available in this guide.

Developing with dbt

Anatomy of a dbt model

dbt models are a combination of SQL (using SELECT statements) and Jinja templating language to define templates.

Below is an example ofabt model:

{{
    config(materialized='table')
}}

SELECT *
FROM staging.source_table
WHERE record_state = 'ACTIVE'        

  • In the Jinja statement defined within {{ }} block we call the config function. More information about Jinja and how to use it for dbt in this link
  • The config function is commonly used at the beginning of a model to define a materialization strategy: a strategy for persisting dbt models in a data warehouse
  • There are 4 materialization strategies with the following characteristics:
  • View: When using the view materialization, the model is rebuilt as a view on each run, via a create view as statement
  • Table: The model is rebuilt as a table on each run, via a create table as statement
  • Incremental: Allow dbt to insert or update records into a table since the last time that dbt as run
  • Ephemeral: Are not directly build into the database. Instead, dbt will interpolate the code from this model into dependent models as a common table expression (CTE)

The FROM clause of a dbt model

The?FROM?clause within a?SELECT?statement defines the?sources?of the data to be used.

The following sources are available to dbt models:

Sources: The data loaded within our Data Warehouse.

  • We can access this data with the?source()?function.
  • The?sources?key in our YAML file contains the details of the databases that the?source()?function can access and translate into proper SQL-valid names.
  • Additionally, we can define "source freshness" to each source so that we can check whether a source is "fresh" or "stale", which can be useful to check whether our data pipelines are working properly.
  • More info about sources?in this link.

Seeds: CSV files which can be stored in our repo under the?seeds?folder.

  • The repo gives us version controlling along with all of its benefits.
  • Seeds are best suited to static data which changes infrequently.
  • Seed usage:

  1. Add a CSV file to your?seeds?folder.
  2. Run the?dbt seed?command?to create a table in our Data Warehouse.

  • If you update the content of a seed, running?dbt seed?will append the updated values to the table rather than substituing them. Running?dbt seed --full-refresh?instead will drop the old table and create a new one.

  1. Refer to the seed in your model with the?ref()?function.

Here's an example of how you would declare a source in a?.yml?file:

sources:
    - name: staging
      database: production
      schema: trips_data_all

      loaded_at_field: record_loaded_at
      tables:
        - name: green_tripdata
        - name: yellow_tripdata
          freshness:
            error_after: {count: 6, period: hour}        

And here's how you would reference a source in a?FROM?clause:

FROM {{ source('staging','yellow_tripdata') }}        

  • The first argument of the?source()?function is the source name, and the second is the table name.

In the case of seeds, assuming you've got a?taxi_zone_lookup.csv?file in your?seeds?folder which contains?locationid,?borough,?zone?and?service_zone:

SELECT
    locationid,
    borough,
    zone,
    replace(service_zone, 'Boro', 'Green') as service_zone
FROM {{ ref('taxi_zone_lookup) }}        

The?ref()?function references underlying tables and views in the Data Warehouse. When compiled, it will automatically build the dependencies and resolve the correct schema fo us. So, if BigQuery contains a schema/dataset called?dbt_dev?inside the?my_project?database which we're using for development and it contains a table called?stg_green_tripdata, then the following code...

WITH green_data AS (
    SELECT *,
        'Green' AS service_type
    FROM {{ ref('stg_green_tripdata') }}
),        

...will compile to this:

WITH green_data AS (
    SELECT *,
        'Green' AS service_type
    FROM "my_project"."dbt_dev"."stg_green_tripdata"
),        

  • The?ref()?function translates our references table into the full reference, using the?database.schema.table?structure.
  • If we were to run this code in our production environment, dbt would automatically resolve the reference to make ir point to our production schema.

Defining a source and creating a model

It's time to create our first model.

We will begin by creating 2 new folders under our?models?folder:

  • staging?will have the raw models.
  • core?will have the models that we will expose at the end to the BI tool, stakeholders, etc.

Under?staging?we will add 2 new files:?sgt_green_tripdata.sql?and?schema.yml:

# schema.yml

version: 2

sources:
    - name: staging
      database: your_project
      schema: trips_data_all

      tables:
          - name: green_tripdata
          - name: yellow_tripdata        

  • We define our?sources?in the?schema.yml?model properties file.
  • We are defining the 2 tables for yellow and green taxi data as our sources.

-- sgt_green_tripdata.sql

{{ config(materialized='view') }}

select * from {{ source('staging', 'green_tripdata') }}
limit 100        

  • This query will create a?view?in the?staging?dataset/schema in our database.
  • We make use of the?source()?function to access the green taxi data table, which is defined inside the?schema.yml?file.

The advantage of having the properties in a separate file is that we can easily modify the?schema.yml?file to change the database details and write to different databases without having to modify our?sgt_green_tripdata.sql?file.

You may know run the model with the?dbt run?command, either locally or from dbt Cloud.

Macros

Macros?are pieces of code in Jinja that can be reused, similar to functions in other languages.

dbt already includes a series of macros like?config(),?source()?and?ref(), but custom macros can also be defined.

Macros allow us to add features to SQL that aren't otherwise available, such as:

  • Use control structures such as?if?statements or?for?loops.
  • Use environment variables in our dbt project for production.
  • Operate on the results of one query to generate another query.
  • Abstract snippets of SQL into reusable macros.

Macros are defined in separate?.sql?files which are typically stored in a?macros?directory.

There are 3 kinds of Jinja?delimiters:

  • {% ... %}?for?statements?(control blocks, macro definitions)
  • {{ ... }}?for?expressions?(literals, math, comparisons, logic, macro calls...)
  • {# ... #}?for comments.

Here's a macro definition example:

{# This macro returns the description of the payment_type #}

{% macro get_payment_type_description(payment_type) %}

    case {{ payment_type }}
        when 1 then 'Credit card'
        when 2 then 'Cash'
        when 3 then 'No charge'
        when 4 then 'Dispute'
        when 5 then 'Unknown'
        when 6 then 'Voided trip'
    end

{% endmacro %}        

  • The?macro?keyword states that the line is a macro definition. It includes the name of the macro as well as the parameters.
  • The code of the macro itself goes between 2 statement delimiters. The second statement delimiter contains an?endmacro?keyword.
  • In the code, we can access the macro parameters using expression delimiters.
  • The macro returns the?code?we've defined rather than a specific value.

Here's how we use the macro:

select
    {{ get_payment_type_description('payment-type') }} as payment_type_description,
    congestion_surcharge::double precision
from {{ source('staging','green_tripdata') }}
where vendorid is not null        

  • We pass a?payment-type?variable which may be an integer from 1 to 6.

And this is what it would compile to:

select
    case payment_type
        when 1 then 'Credit card'
        when 2 then 'Cash'
        when 3 then 'No charge'
        when 4 then 'Dispute'
        when 5 then 'Unknown'
        when 6 then 'Voided trip'
    end as payment_type_description,
    congestion_surcharge::double precision
from {{ source('staging','green_tripdata') }}
where vendorid is not null        

  • The macro is replaced by the code contained within the macro definition as well as any variables that we may have passed to the macro parameters.

Packages

Macros can be exported to?packages, similarly to how classes and functions can be exported to libraries in other languages. Packages contain standalone dbt projects with models and macros that tackle a specific problem area.

When you add a package to your project, the package's models and macros become part of your own project. A list of useful packages can be found in the?dbt package hub.

To use a package, you must first create a?packages.yml?file in the root of your work directory. Here's an example:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.8.0        

After declaring your packages, you need to install them by running the?dbt deps?command either locally or on dbt Cloud.

You may access macros inside a package in a similar way to how Python access class methods:

select
    {{ dbt_utils.surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,
    cast(vendorid as integer) as vendorid,
    -- ...        

  • The?surrogate_key()?macro generates a hashed?surrogate key?with the specified fields in the arguments.

Variables

Like most other programming languages,?variables?can be defined and used across our project.

Variables can be defined in 2 different ways:

  • Under the?vars?keyword inside?dbt_project.yml.

vars:
    payment_type_values: [1, 2, 3, 4, 5, 6]        

  • As arguments when building or running your project.

dbt build --m <your-model.sql> --var 'is_test_run: false'        

Variables can be used with the?var()?macro. For example:

{% if var('is_test_run', default=true) %}

    limit 100

{% endif %}        

  • In this example, the default value for?is_test_run?is?true; in the absence of a variable definition either on the?dbt_project.yml?file or when running the project, then?is_test_run?would be?true.
  • Since we passed the value?false?when runnning?dbt build, then the?if?statement would evaluate to?false?and the code within would not run.

Alexey Grigorev

Founder of DataTalks.Club | dstack ambassador

2 年

Alvaro's notes are the best!

Pedro Peres Martins

Practice Director, Data & Intelligence | Strategy & Execution | Executive Team Leadership | Go-to-Market | Offering Strategy & Development | Alliances & Partnerships

2 年

Keep going Filipe!

要查看或添加评论,请登录

Filipe Balseiro的更多文章

  • Introduction to Streaming - Apache Kafka

    Introduction to Streaming - Apache Kafka

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…

  • Spark - Setting up a Dataproc Cluster on GCP

    Spark - Setting up a Dataproc Cluster on GCP

    Dataproc is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto…

    6 条评论
  • Apache Spark

    Apache Spark

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

    3 条评论
  • DBT- Data Build Tool (Part II)

    DBT- Data Build Tool (Part II)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

    2 条评论
  • BigQuery

    BigQuery

    Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…

  • DataCamp - Data Engineering with Python

    DataCamp - Data Engineering with Python

    Data Engineers Data engineers deliver: The correct data In the right form To the right people As efficiently as…

  • Youtubers Popularity

    Youtubers Popularity

    Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

    12 条评论
  • Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Case Study: Help a bike-share company to convert casual riders into annual members In this article I showcase my…

社区洞察