DBT ZERO TO HERO

DBT ZERO TO HERO

In this article we focus on dbt ,its functionalities ,its integrations and its components in the buiding and scaling data engineering pipelines and most importantly ELT type of data pipelines.DBT commonly reffered to as DATA BUILD TOOL is a transformation tool used in data pipelines Data pipelines involve processes such as #datamodelling ,data quality tests , data cleaning ,data aggregations,data filtering,data merging.


These transformations are done when the data is on the data warehouse such as snowflake and amazon redshift also works with databricks in azure when the dimension tables and fact tables have been created the tables and data are loaded to data visualization softwares such as Powerbi ,Tableau , superset .

some of the components of DBT tools that you will find,

  • SNAPSHOT: This is a table-like construct created in DBT to track changes in data. We use Slowly Changing Dimension 2 when we want to preserve old data. It tracks changes, including when data was altered or deleted. This feature can be useful for detecting fraud or data manipulation. Additionally, it ensures the schema of the data remains consistent for new entries, simplifying data cleaning. We run snapshots using the command dbt snapshot.
  • SOURCES: This is an abstraction layer on top of input tables in the data warehouse. Essentially, it defines where the table data will be accessed from in the data warehouse, guiding the model on where to find the data it needs.
  • SEEDS: These are CSV files uploaded from DBT to Snowflake via VS Code. Once uploaded, they become seeds. We run seeds using the command dbt seed.
  • TESTS: There are generic and singular tests. Generic tests include unique values, not null, accepted values, and relationships. Singular tests can be created and linked to macros. An example is a test that iterates over all columns in a table to ensure there are no null values.
  • SCHEMA.YML: This file is crucial for the project. It contains additional information such as column descriptions (visible in documentation) and test placements.
  • MACROS: These are Jinja templates used in SQL and created in macro files. Macros can be used as singular tests by linking to another file in the tests folder where the macro test is defined.
  • PACKAGES.YML: This is where we install packages used in the project, such as dbt utils and Great Expectations.
  • ASSETS: This file is not visible in the project's path but needs to be added, along with the connection to the dbt_project.yml path. It may contain an overview of the project after documentation in the localhost , such as a model's photo, accurately representing the project.
  • DBT_PROJECTS.YML: This file defines the entire project, including configurations, setting paths, and defining materializations (e.g., table, view, or ephemeral). Materializations depend on visualization needs; for dimensional tables, set materializations as tables, while for original tables not used in visualizations, set materializations as ephemeral. Incremental materializations are suitable for tables likely to have additional data.
  • HOOKS AND EXPOSURES: These are connections that link the BI tool to a webpage or visualization tool. This integration allows visualization to be part of the project and deployed through tools and exposures.
  • MATERIALIZATIONS: Materializing tables involves determining how we will view the models in the data warehouse, and materializations include three main types:

All these are folders will be in the dbt folder when you initialize it and these will be the projects that will be materialized in the dbt project,initializing the data science project in all the operating systems will be available in the github project.


Orchestrating the dbt project involves some of the following tools;

ORCHESTRATION

While performing data engineering projects, such processes can be automated, and there are various factors that influence the orchestration of a particular project. You can find information on this in the following article , which will provide clear insights.

In the following project, we will be orchestrating our workflow using Dagster. Although there are other orchestration tools available, I will also explain why we have chosen this particular tool. We will focus on some of the tools that offer this integration.

Apache airflow

  • The installation can be challenging; to run it normally on our machine, it requires a Docker container, which can be challenging to set up.
  • If run through the cloud, it becomes very expensive. Running it on an Amazon Redshift container incurs charges that can become prohibitive, especially when keeping the Airflow instance running for different data synchronizations at various time frames.
  • Airflow does not integrate seamlessly with the dbt instance; it primarily focuses on running jobs, making it difficult to debug projects and identify errors. Thus, it is not an ideal tool to use. However, recent technological advancements have addressed this challenge, allowing integration with Cosmos to overcome this issue.
  • On the positive side, this tool is open source, making it easy to use.

Prefect

  • It has a simple integration
  • it is also opensource

Azure data factory

  • does not really have a good integration
  • its nnot open source

dbt cloud

  • Has a very tight and good integration and you can use the commands that you want to put and use
  • not open source its paid for

DAGSTER

  • Has a very great UI and very user friendly
  • very easy to debug
  • Its opensource
  • No prior codig of jobs and tasks thuas with jsut very fwe lines of code it is up and running

For the whole implementation of the project can be made on the following github repository and you can fork this repo to do this project. here is the repo for the project https://github.com/stilinsk/dbt_zero_to_hero


#letslearndata

要查看或添加评论,请登录

社区洞察

其他会员也浏览了