Data Build Tool(DBT) — Aamir P
This is a command-line environment that allows you to transform and model the data in data warehousing using SQL queries. They are mainly used for data transformation and analytics tasks in data pipelines.
DBT transforms raw data into structured, analysis-ready tables. Models will be in .sql files. Your dbt directory will organise your SQL models, configurations and tests. A profile is a configuration file that will connect to the data warehouse. Seeds are tables or files that contain raw data. Seeds are like source data. Snapshots are a way to create time-based snapshots of your data models. They’re useful for maintaining historical data in your warehouse.
To start with
pip install dbt
Now to create a dbt project
dbt init my_project
Inside your project directory, you’ll find a dbt_project.yml file. This is where you configure your project settings, such as database connection details.
Models are created by .sql extensions.
To run dbt
cd my_project
dbt run
To validate the quality of your data
dbt test
You can document your files in YAML. SQL Macros is used to reuse your dbt codes. Collaboration is possible with data engineers, data analysts, and data scientists to work together. The dbt supports different materialization strategies, including views, tables, and incremental models. You can deploy your dbt project using CI/CD pipelines or tools like dbt Cloud. You can use snapshotting to maintain historical data in your data warehouse. dbt provides various CLI commands for tasks like generating documentation, cleaning, and linting SQL files. Transformation means filtering, aggregating, and joining data. CI/CD pipelines are used for automated testing and deployment which has continuous integration and continuous deployment.
So, this is the overview of dbt. Let me make you dive in depth.
Data Maturity Model
Just like Maslow’s Pyramid of Data, we have the Data Maturity Model here that starts with a hierarchy of Data Collection, Data Wrangling, Data Integration, BI and Analytics, and lastly AI.
Once we collect the data we clean it. This is like transforming data from raw. Clean data is now moved to the data lake or reporting layer.
Data Extraction means extracting data from various sources. Sources like senors, Google Analytics, etc. Collected data is stored in the staging area. Transformation takes place here like reducing duplicates, naming mistakes, etc. Data integration is all about writing our transformed data from the staging area to the target.
ETL and ELT
Snowflake, Redshift, etc. are extremely scalable and hence transformations occur inside the database. ETL is possible here. In other cases, we need to opt for ELT.
Data Warehouse
The purpose of data warehousing is for analytics and reporting. They can handle unstructured data as well. Data Warehouse has two streams namely on-premise and cloud. Examples of on-premise are IBM, Oracle, etc. Examples of clouds are Snowflake, Redshift, etc.
Data Lake
This is used to store large files outside the database. For example, Snowflake can’t store huge data in a table. So, we store that data in an Amazon S3 bucket i.e. our external link and use the link for querying purposes.
Lakehouse
We have similar features of Data Warehouse like data structure and data management. It sits on top of cloud storage.
SCD(Slowly Changing Dimensions)
Some companies store historical data, they use SCD.
SCD Type 0: Does not update the DWH table when the dimension changes.
SCD Type 1: Updating the DWH table when a Dimension changes, overwrites the original data.
SCD Type 2: A new row is added to the original data.
SCD Type 3: Adds a new column.
SCD Type 4: Historical data is stored in a separate table.
SCD Type 6: Combines SCD — 1,2,3 to maintain historical data.
Prerequisites
To set up dbt, ensure you install Python 3.11, as dbt supports versions 3.7 to 3.11, but not 3.12. Install Git to avoid an error during dbt setup, though it’s not critical if you can’t. When setting up your dbt project, manually create a .dbt folder on Windows or Mac/Linux to prevent project creation issues. For Windows, use mkdir %userprofile%\.dbt in cmd; for Mac/Linux, use mkdir ~/.dbt in terminal. Avoid copy-pasting your Snowflake password during setup, as it may cause issues. If using WSL on Windows, follow WSL-specific instructions. Mac users should install Python 3.11 via Homebrew. Linux users can follow the Mac instructions. While dbt Cloud setups are possible, support is limited, and some features may not work fully.
Virtual Environment
dir means directory.
cd is to move to the right place.
pip install virtualenv installing the virtual env package.
venv\Scripts\activate to activate the environment
DBT Installation
python — — version to know the version of python we are using.
pip install dbt-snowflake==1.7.1
dbt-core is like connectors.
Commands
build — Run all seeds, models, snapshots and tests.
clean — Delete all folders
clone — cloning
compile — Generates executable SQL
debug — Check if anything is not proper
deps — Install dbt packages specified
docs — Generate or serve the documentation website
init — New DBT project
list — List the resources in your project
parse — Parses the project and provides information on performance
retry- Retry the nodes that failed in previous run
run — Compile and runs SQL
run-operation — Run the named macro
seed — load data from CSV files
show — Generates executable SQL
snapshot — Execute Snapshots
source — Manages project sources
test — Run tests on data in deployed models
If you encounter Snowflake connection issues while running dbt debug, ensure you're not behind a corporate firewall or proxy, connected to a VPN, or using a network with a self-signed certificate. Check that your laptop has full internet access. If problems persist, consider using a different laptop or Wi-Fi. If issues with Python or the virtual environment remain unresolved, you can use GitHub Codespaces for a pre-configured environment with full Snowflake connectivity.
领英推荐
When you run DBT it will ask for username, password, etc. as credentials to configure with Snowflake.
Overview of DBT
dbt_project.yml file is a configuration file that defines key settings, including project name, version, model configurations, and resource paths. An example:-
# Name of your dbt project name: my_dbt_project
# Version of your project version: 1.0.0
# Specify the default schema (database) for models profile: my_profile
# Directory where dbt will look for models source-paths: [“models”]
# Directory for compiled SQL files target-path: “target”
# Directory for documentation files docs-path: “docs”
# Directory for macros macro-paths: [“macros”]
# Directory for seed files seed-paths: [“data”]
# Directory for analysis files analysis-paths: [“analysis”]
# The default database schema for this project models: my_dbt_project: schema: my_schema materialized: view
# Optional: Tests, snapshots, and other resources can be configured similarly
2. tests:
3. logs:
4. macros:
5. models:
6. seeds:
7. snapshots:
8. gitignore:
CTE
These are used for memory purposes. They are used to simplify complex SQL queries and improve readability and maintainability.
Materialisations
Seeds and Sources(dbt Source Freshness)
Snapshots
Tests
Singular and Generic are the two types.
There are 4 built-in generic tests:
You can generate your own custom generic tests or import tests from dbt packages.
Singular are SQL queries stored in tests which are expected to return an empty resultset. You define it in the form of SQL files in the tests directory.
Macros
These are Jinja templates created in the macros folder. There are many built-in macros in DBT. Use these macros in model definitions and tests. A special macro called a test can be used to implement our generic tests. DBT packages can be installed easily to get access to macros and tests. These are reusable SQL and are used for custom tests.
Documentation
Two types of documentation can be done: yaml files(eg: schema.yml) and standalone markdown files. DBT ships with a lightweight documentation web server. For customising the landing page, a special file, overview.md is used. You can add images and other assets to a special folder.
A post-hook in dbt_profile.yml is executed after the execution of every associated model.
Great Expectations
Great Expectations is a popular data quality framework, and integrating it into DBT (Data Build Tool) allows you to add data validation and testing directly into your data pipelines. It provides tests like checking for null values, and duplicates, or ensuring data is within specified ranges. These tests (expectations) can be added to DBT models as part of your transformation workflow.
Benefits:
Commands for log messages to the dbt.log file are {{ log(“Message”)}} or {{ log(“Message”, info=True) }}. {{ log(“Message”, info=True) }} This command will put the log message on the screen. To disable logging temporarily convert the {{ log( … ) }} line into {# log(…) #}. DBT variables are referenced like for example var(“variable_name”). You check variable existence by using {% if var(“variable_name”, False) %}. You can use the generate button to generate documentation.
Not an expert in DBT, just now learning. Pls, feel free to tell me the corrections!
Check out this link to know more about me
Let’s get to know each other! https://lnkd.in/gdBxZC5j
Get my books, podcasts, placement preparation, etc. https://linktr.ee/aamirp
Get my Podcasts on Spotify https://lnkd.in/gG7km8G5
Catch me on Medium https://lnkd.in/gi-mAPxH
Follow me on Instagram https://lnkd.in/gkf3KPDQ
Udemy Udemy (Python Course) https://lnkd.in/grkbfz_N
Subscribe to my Channel for more useful content.