登录查看更多内容

DBT ZERO TO HERO

Simon Ngugi

DATA ENGINEER||ANALYTICS ENGINEER |||Transforming data to business value

发布日期: 2024年4月10日

In this article we focus on dbt ,its functionalities ,its integrations and its components in the buiding and scaling data engineering pipelines and most importantly ELT type of data pipelines.DBT commonly reffered to as DATA BUILD TOOL is a transformation tool used in data pipelines Data pipelines involve processes such as #datamodelling ,data quality tests , data cleaning ,data aggregations,data filtering,data merging.

These transformations are done when the data is on the data warehouse such as snowflake and amazon redshift also works with databricks in azure when the dimension tables and fact tables have been created the tables and data are loaded to data visualization softwares such as Powerbi ,Tableau , superset .

some of the components of DBT tools that you will find,

SNAPSHOT: This is a table-like construct created in DBT to track changes in data. We use Slowly Changing Dimension 2 when we want to preserve old data. It tracks changes, including when data was altered or deleted. This feature can be useful for detecting fraud or data manipulation. Additionally, it ensures the schema of the data remains consistent for new entries, simplifying data cleaning. We run snapshots using the command dbt snapshot.
SOURCES: This is an abstraction layer on top of input tables in the data warehouse. Essentially, it defines where the table data will be accessed from in the data warehouse, guiding the model on where to find the data it needs.
SEEDS: These are CSV files uploaded from DBT to Snowflake via VS Code. Once uploaded, they become seeds. We run seeds using the command dbt seed.
TESTS: There are generic and singular tests. Generic tests include unique values, not null, accepted values, and relationships. Singular tests can be created and linked to macros. An example is a test that iterates over all columns in a table to ensure there are no null values.
SCHEMA.YML: This file is crucial for the project. It contains additional information such as column descriptions (visible in documentation) and test placements.
MACROS: These are Jinja templates used in SQL and created in macro files. Macros can be used as singular tests by linking to another file in the tests folder where the macro test is defined.
PACKAGES.YML: This is where we install packages used in the project, such as dbt utils and Great Expectations.
ASSETS: This file is not visible in the project's path but needs to be added, along with the connection to the dbt_project.yml path. It may contain an overview of the project after documentation in the localhost , such as a model's photo, accurately representing the project.
DBT_PROJECTS.YML: This file defines the entire project, including configurations, setting paths, and defining materializations (e.g., table, view, or ephemeral). Materializations depend on visualization needs; for dimensional tables, set materializations as tables, while for original tables not used in visualizations, set materializations as ephemeral. Incremental materializations are suitable for tables likely to have additional data.
HOOKS AND EXPOSURES: These are connections that link the BI tool to a webpage or visualization tool. This integration allows visualization to be part of the project and deployed through tools and exposures.
MATERIALIZATIONS: Materializing tables involves determining how we will view the models in the data warehouse, and materializations include three main types:

All these are folders will be in the dbt folder when you initialize it and these will be the projects that will be materialized in the dbt project,initializing the data science project in all the operating systems will be available in the github project.

Orchestrating the dbt project involves some of the following tools;

ORCHESTRATION

While performing data engineering projects, such processes can be automated, and there are various factors that influence the orchestration of a particular project. You can find information on this in the following article , which will provide clear insights.

In the following project, we will be orchestrating our workflow using Dagster. Although there are other orchestration tools available, I will also explain why we have chosen this particular tool. We will focus on some of the tools that offer this integration.

领英推荐

9 Predictions for Data in 2023

Tomasz Tunguz 2 年前

Delta Live Tables — Part 4— Data Processing and…

Krishna Yogi Kolluru 4 个月前

Databricks SQL Series — Part 5 — Managing and Securing…

Krishna Yogi Kolluru 4 个月前

Apache airflow

The installation can be challenging; to run it normally on our machine, it requires a Docker container, which can be challenging to set up.
If run through the cloud, it becomes very expensive. Running it on an Amazon Redshift container incurs charges that can become prohibitive, especially when keeping the Airflow instance running for different data synchronizations at various time frames.
Airflow does not integrate seamlessly with the dbt instance; it primarily focuses on running jobs, making it difficult to debug projects and identify errors. Thus, it is not an ideal tool to use. However, recent technological advancements have addressed this challenge, allowing integration with Cosmos to overcome this issue.
On the positive side, this tool is open source, making it easy to use.

Prefect

It has a simple integration
it is also opensource

Azure data factory

does not really have a good integration
its nnot open source

dbt cloud

Has a very tight and good integration and you can use the commands that you want to put and use
not open source its paid for

DAGSTER

Has a very great UI and very user friendly
very easy to debug
Its opensource
No prior codig of jobs and tasks thuas with jsut very fwe lines of code it is up and running

For the whole implementation of the project can be made on the following github repository and you can fork this repo to do this project. here is the repo for the project https://github.com/stilinsk/dbt_zero_to_hero

#letslearndata

DBT ZERO TO HERO

Simon Ngugi

DATA ENGINEER||ANALYTICS ENGINEER |||Transforming data to business value

ORCHESTRATION

领英推荐

Apache airflow

Prefect

Azure data factory

dbt cloud

DAGSTER

更多精彩文章

社区洞察

其他会员也浏览了

Selected Data Engineering Posts . . . February 2024

?? This Week's Digital Beat: "The ETL Groove" Takes Over! ??

Alvin + Mode: Data Lineage and Usage for your Reports & Charts

Business Intelligence meets Data Engineering with Emerging Technologies

Data Modeling

Data Build Tool(DBT) — Aamir P

Is Data Modeling Needed Anymore? Kimball++

Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

dbt (Data Build Tool) Overview: What is dbt and What Can It Do for My Data Pipeline?

Architecting Data Pipelines

ORCHESTRATION

领英推荐

Apache airflow

Prefect

Azure data factory

dbt cloud

DAGSTER

KAFKA

2024年4月16日

DATABASE NORMALIZATION

2024年4月12日

社区洞察

其他会员也浏览了

Selected Data Engineering Posts . . . February 2024

?? This Week's Digital Beat: "The ETL Groove" Takes Over! ??

Alvin + Mode: Data Lineage and Usage for your Reports & Charts

Business Intelligence meets Data Engineering with Emerging Technologies

Data Modeling

Data Build Tool(DBT) — Aamir P

Is Data Modeling Needed Anymore? Kimball++

Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

dbt (Data Build Tool) Overview: What is dbt and What Can It Do for My Data Pipeline?

Architecting Data Pipelines