Introduction to DBT (Data Build Tool)
NIKHIL G R
Serving Notice Period, Cloud Data Engineer at TCS, 2x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, ADLs, Azure Synapse, Azure Data factory, MySQL, Lake House, Delta Lake, Data Enthusiast
dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse more effectively. It allows users to define transformations as SQL select statements and manage them as version-controlled models.
dbt is the T in ELT.?It doesn't extract or load data, but it's extremely good at transforming data that's already loaded into the warehouse.
There are two types of products dbt offers:
Differences b/w dbt Core and dbt Cloud
dbt Cloud Pricing Plan
Team Plan
The read only seats are included with the purchase of the Developer seats. We will get 5 free read only seats with even a purchase of 1 developer seat on Teams. We can purchase up to 8 developer seats on Teams but the free read only seats?stay at 5.
As an example: If we purchase 1 Developer seat on Teams plan, we will receive 5 free read-only. If we purchase 6 developer seats, we will still have 5 free read-only. If we want to only purchase 6 developer seats on Teams, then the max would be $600/Month.?
Basic Terminologies in dbt
dbt Projects - Basically, a dbt project is a directory on the machine containing everything required to perform transformations on our data. It contains?.sql?files (called models) and YAML files (for configurations).
dbt Project Profiles - A project profile is a YAML file containing the connection details for our chosen data?platform.
dbt model - The atomic unit of transformation in dbt is called a model. Every dbt model defines all of its inputs using a special function called ref().?Models are primarily written as a select statement and saved as a .sql file. We can use CTEs (Common Table Expressions) and apply transforms using SQL
Source - Source tables refer to tables loaded into the warehouse by an Extract Load process.
Snapshots - A snapshot model in dbt is a special kind of model that tracks changes in the data over time.
Staging - The staging area is where raw data is cast into correct data types, given consistent column names, and prepared to be transformed into models used by end-users.
Marts - Marts consist of the core tables for end-users and business vertical-specific tables.
Core - The core defines the fact and dimension models to be used by end-users.
How are data models created?
Every?dbt?model defines?all of?its inputs using a special Jinja function called?ref().?
dbt?can select different parts of the DAG to execute.?
Materializations define how to turn a model into a warehouse?object.?
dbt?has some built-in:?
{{config( materialized='table' )}}
select ... from?table_name
is rendered?to:
CREATE TABLE ... AS ( select ... from table_name);
# run all the models
$ dbt run
# run one specific model
$ dbt run --models
# run a model and those that depend on it
$ dbt run --models +
Testing Framework in dbt
dbt provides a powerful framework for testing data models. It allows us to define tests on your models to ensure data quality and integrity.
Inbuilt tests
We can test Uniqueness, Not Null, and Foreign Key constraints etc.
version: 2
models:
- name: employee
description: This table contains the employee data loaded
columns:
- name: id
description: The unique key of employee table
tests:
- not_null
- unique
- name: emp_name
description: The Name of an employee
Custom Tests
We can even define our custom logic to test the models.
-- marks_between.sql
{% test marks_between(model, column_name) %}
with validation as (
select
{{ column_name }} as marks,
from {{ model }}
),
validation_errors as (
select
marks
from validation
where
(marks < 0 OR expected_marks > 100)
)
select *
from validation_errors
{% endtest %}
# student.yml
version: 2
models:
- name: student
description: This table contains the student data loaded
columns:
- name: student_id
description: The unique key of student data
tests:
- not_null
tests:
- marks_between
description: The value of the marks
Debugging and Version Control
dbt provides built-in logging functionality that captures various events and execution statuses during its operation.
It also offers git-based version control.
Auto Documentation and Lineage
dbt generates documentation of the data model, provided the details in yml file.
To generate documentation and serve them, run the following commands:
$ dbt docs generate
$ dbt docs serve
Data lineage in?dbt? is?represented through the DAG,?which shows how data is?transformed from raw?sources to the final models?used for analysis.
This lineage graph helps?teams understand the?dependencies and?relationships between?different data models, making?it easier to identify potential?issues and inefficiencies.
Incremental Loading in dbt
Dbt's incremental loading mechanism relies on comparing source and target data to determine which records need to be updated or inserted. For large datasets or complex transformations, this comparison process can be resource-intensive and may impact performance, especially when dealing with tables with millions of records.
While dbt provides a simple mechanism for implementing basic incremental loading based on date or timestamp columns, it may lack support for more complex incremental logic, such as delta or CDC (Change Data Capture) mechanisms.
{{
config(
materialized='incremental'
)
}}
select
*,
my_slow_function(my_column)
from {{ ref('app_data_events') }}
{% if is_incremental() %}
where event_time >= (select coalesce(max(event_time), '1900-01-01') from {{ this }})
{% endif %}
Benefits of Using Databricks with dbt
Advantages and Disadvantages of using dbt
Advantages
Disadvantages
Using dbt for Stream Processing
Just with dbt we can’t do Streaming Processing. We need adapters with dbt to do the same.?
dbt is primarily designed for batch-oriented data processing and modeling rather than stream processing. While dbt is not specifically tailored for stream processing, it can still be used in conjunction with stream processing frameworks or tools to handle certain aspects of data transformation, modeling, and analytics.
Steps for using dbt with Azure Databricks
References