Data Build Tool(DBT) — Aamir P
DBT - Aamir P

Data Build Tool(DBT) — Aamir P

This is a command-line environment that allows you to transform and model the data in data warehousing using SQL queries. They are mainly used for data transformation and analytics tasks in data pipelines.

DBT transforms raw data into structured, analysis-ready tables. Models will be in .sql files. Your dbt directory will organise your SQL models, configurations and tests. A profile is a configuration file that will connect to the data warehouse. Seeds are tables or files that contain raw data. Seeds are like source data. Snapshots are a way to create time-based snapshots of your data models. They’re useful for maintaining historical data in your warehouse.

To start with

pip install dbt

Now to create a dbt project

dbt init my_project

Inside your project directory, you’ll find a dbt_project.yml file. This is where you configure your project settings, such as database connection details.

Models are created by .sql extensions.

To run dbt

cd my_project

dbt run

To validate the quality of your data

dbt test

You can document your files in YAML. SQL Macros is used to reuse your dbt codes. Collaboration is possible with data engineers, data analysts, and data scientists to work together. The dbt supports different materialization strategies, including views, tables, and incremental models. You can deploy your dbt project using CI/CD pipelines or tools like dbt Cloud. You can use snapshotting to maintain historical data in your data warehouse. dbt provides various CLI commands for tasks like generating documentation, cleaning, and linting SQL files. Transformation means filtering, aggregating, and joining data. CI/CD pipelines are used for automated testing and deployment which has continuous integration and continuous deployment.

So, this is the overview of dbt. Let me make you dive in depth.

Data Maturity Model

Just like Maslow’s Pyramid of Data, we have the Data Maturity Model here that starts with a hierarchy of Data Collection, Data Wrangling, Data Integration, BI and Analytics, and lastly AI.

Once we collect the data we clean it. This is like transforming data from raw. Clean data is now moved to the data lake or reporting layer.

Data Extraction means extracting data from various sources. Sources like senors, Google Analytics, etc. Collected data is stored in the staging area. Transformation takes place here like reducing duplicates, naming mistakes, etc. Data integration is all about writing our transformed data from the staging area to the target.

ETL and ELT

Snowflake, Redshift, etc. are extremely scalable and hence transformations occur inside the database. ETL is possible here. In other cases, we need to opt for ELT.

Data Warehouse

The purpose of data warehousing is for analytics and reporting. They can handle unstructured data as well. Data Warehouse has two streams namely on-premise and cloud. Examples of on-premise are IBM, Oracle, etc. Examples of clouds are Snowflake, Redshift, etc.

Data Lake

This is used to store large files outside the database. For example, Snowflake can’t store huge data in a table. So, we store that data in an Amazon S3 bucket i.e. our external link and use the link for querying purposes.

Lakehouse

We have similar features of Data Warehouse like data structure and data management. It sits on top of cloud storage.

SCD(Slowly Changing Dimensions)

Some companies store historical data, they use SCD.

SCD Type 0: Does not update the DWH table when the dimension changes.

SCD Type 1: Updating the DWH table when a Dimension changes, overwrites the original data.

SCD Type 2: A new row is added to the original data.

SCD Type 3: Adds a new column.

SCD Type 4: Historical data is stored in a separate table.

SCD Type 6: Combines SCD — 1,2,3 to maintain historical data.

Prerequisites

To set up dbt, ensure you install Python 3.11, as dbt supports versions 3.7 to 3.11, but not 3.12. Install Git to avoid an error during dbt setup, though it’s not critical if you can’t. When setting up your dbt project, manually create a .dbt folder on Windows or Mac/Linux to prevent project creation issues. For Windows, use mkdir %userprofile%\.dbt in cmd; for Mac/Linux, use mkdir ~/.dbt in terminal. Avoid copy-pasting your Snowflake password during setup, as it may cause issues. If using WSL on Windows, follow WSL-specific instructions. Mac users should install Python 3.11 via Homebrew. Linux users can follow the Mac instructions. While dbt Cloud setups are possible, support is limited, and some features may not work fully.

Virtual Environment

dir means directory.

cd is to move to the right place.

pip install virtualenv installing the virtual env package.

venv\Scripts\activate to activate the environment

DBT Installation

python — — version to know the version of python we are using.

pip install dbt-snowflake==1.7.1

dbt-core is like connectors.

Commands

build — Run all seeds, models, snapshots and tests.

clean — Delete all folders

clone — cloning

compile — Generates executable SQL

debug — Check if anything is not proper

deps — Install dbt packages specified

docs — Generate or serve the documentation website

init — New DBT project

list — List the resources in your project

parse — Parses the project and provides information on performance

retry- Retry the nodes that failed in previous run

run — Compile and runs SQL

run-operation — Run the named macro

seed — load data from CSV files

show — Generates executable SQL

snapshot — Execute Snapshots

source — Manages project sources

test — Run tests on data in deployed models

If you encounter Snowflake connection issues while running dbt debug, ensure you're not behind a corporate firewall or proxy, connected to a VPN, or using a network with a self-signed certificate. Check that your laptop has full internet access. If problems persist, consider using a different laptop or Wi-Fi. If issues with Python or the virtual environment remain unresolved, you can use GitHub Codespaces for a pre-configured environment with full Snowflake connectivity.

When you run DBT it will ask for username, password, etc. as credentials to configure with Snowflake.

Overview of DBT

dbt_project.yml file is a configuration file that defines key settings, including project name, version, model configurations, and resource paths. An example:-

# Name of your dbt project name: my_dbt_project

# Version of your project version: 1.0.0

# Specify the default schema (database) for models profile: my_profile

# Directory where dbt will look for models source-paths: [“models”]

# Directory for compiled SQL files target-path: “target”

# Directory for documentation files docs-path: “docs”

# Directory for macros macro-paths: [“macros”]

# Directory for seed files seed-paths: [“data”]

# Directory for analysis files analysis-paths: [“analysis”]

# The default database schema for this project models: my_dbt_project: schema: my_schema materialized: view

# Optional: Tests, snapshots, and other resources can be configured similarly

  1. analyses:

  • Directory where you store .sql files for ad-hoc analysis or exploration. These analyses do not create models in your database but are used to query and investigate data.

2. tests:

  • Contains custom tests for your dbt models. These can be schema tests (e.g., uniqueness, null checks) or custom SQL-based tests. Tests ensure data quality and integrity.

3. logs:

  • Directory where dbt stores logs from runs, such as dbt run, dbt test, etc. These logs are useful for debugging and understanding the behaviour of your dbt commands.

4. macros:

  • Contains reusable SQL snippets or functions that you can call in your models, tests, or other macros. They help to DRY (Don’t Repeat Yourself) out your SQL logic.

5. models:

  • The core directory where your dbt models reside. Models are .sql files that define transformations on your raw data and create new tables or views in your database.

6. seeds:

  • Contains CSV files that dbt loads into your database as tables. Seeds are often used to load static data or reference tables that are useful for your models.

7. snapshots:

  • Directory for snapshot files, which capture and store the state of your data at specific points in time. Snapshots are useful for slowly changing dimensions and auditing.

8. gitignore:

  • A file that specifies which files and directories Git should ignore in the version control system. It usually includes entries like logs/, target/, and other generated files or directories that do not need to be tracked.

CTE

These are used for memory purposes. They are used to simplify complex SQL queries and improve readability and maintainability.

  1. Although CTEs are conceptually similar to subqueries, they can sometimes help with query optimization by allowing the database engine to better understand the query structure. However, it’s essential to check the execution plan and performance implications as this can vary based on the database system.
  2. CTEs allow you to test and debug each part of the SQL query independently. You can validate individual CTEs to ensure that they produce the expected results before integrating them into more complex queries.
  3. By defining parts of your query as CTEs, you modularize your SQL code. This modular approach can be especially helpful in dbt models, where you want to keep transformations logical and organized.
  4. In dbt models, CTEs are often used to perform intermediate calculations or transformations before producing the final result. They help maintain clean and efficient SQL code within dbt’s transformation models.

Materialisations

  1. View Lightweight representation. We don’t reuse the data often.
  2. Table Read from this model repeatedly.
  3. Incremental (table appends) Fact tables Appends to tables
  4. Ephermal(CTEs) alias to your data

Seeds and Sources(dbt Source Freshness)

  1. Seeds are local files that you upload to the data warehouse from dbt.
  2. Sources are an abstraction layer on top of your input tables.
  3. Source freshness can be checked automatically.

Snapshots

  1. Found in the snapshots folder.
  2. Timestamp strategy: A unique key and an updated_at field are defined in the source model. These columns are used to determine changes.
  3. Check strategy: Any change in a set of columns or all columns will be picked up as an update.
  4. Type 2 SCD tables are accepted.

Tests

Singular and Generic are the two types.

There are 4 built-in generic tests:

  1. Unique
  2. not_null
  3. Accepted_values
  4. relationships

You can generate your own custom generic tests or import tests from dbt packages.

Singular are SQL queries stored in tests which are expected to return an empty resultset. You define it in the form of SQL files in the tests directory.

Macros

These are Jinja templates created in the macros folder. There are many built-in macros in DBT. Use these macros in model definitions and tests. A special macro called a test can be used to implement our generic tests. DBT packages can be installed easily to get access to macros and tests. These are reusable SQL and are used for custom tests.

Documentation

Two types of documentation can be done: yaml files(eg: schema.yml) and standalone markdown files. DBT ships with a lightweight documentation web server. For customising the landing page, a special file, overview.md is used. You can add images and other assets to a special folder.

A post-hook in dbt_profile.yml is executed after the execution of every associated model.

Great Expectations

Great Expectations is a popular data quality framework, and integrating it into DBT (Data Build Tool) allows you to add data validation and testing directly into your data pipelines. It provides tests like checking for null values, and duplicates, or ensuring data is within specified ranges. These tests (expectations) can be added to DBT models as part of your transformation workflow.

  • Set up Great Expectations within DBT using a configuration that ties it into DBT’s transformation tasks.
  • Define your expectations within the DBT models, typically through YAML configurations or within SQL queries.
  • When DBT runs a transformation, the Great Expectations tests automatically validate the data, flagging issues for debugging and enhancing the reliability of your data pipeline.

Benefits:

  • Proactive Data Quality: Issues are caught early in the pipeline.
  • Transparency: Reports show data quality issues, helping with audits.
  • Automation: Data validation is built into the pipeline, reducing manual checks.

Commands for log messages to the dbt.log file are {{ log(“Message”)}} or {{ log(“Message”, info=True) }}. {{ log(“Message”, info=True) }} This command will put the log message on the screen. To disable logging temporarily convert the {{ log( … ) }} line into {# log(…) #}. DBT variables are referenced like for example var(“variable_name”). You check variable existence by using {% if var(“variable_name”, False) %}. You can use the generate button to generate documentation.

Not an expert in DBT, just now learning. Pls, feel free to tell me the corrections!

Check out this link to know more about me

Let’s get to know each other! https://lnkd.in/gdBxZC5j

Get my books, podcasts, placement preparation, etc. https://linktr.ee/aamirp

Get my Podcasts on Spotify https://lnkd.in/gG7km8G5

Catch me on Medium https://lnkd.in/gi-mAPxH

Follow me on Instagram https://lnkd.in/gkf3KPDQ

Udemy Udemy (Python Course) https://lnkd.in/grkbfz_N

YouTube https://www.youtube.com/@knowledge_engine_from_AamirP

Subscribe to my Channel for more useful content.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了