Introduction to DBT (Data Build Tool)

Introduction to DBT (Data Build Tool)

dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse more effectively. It allows users to define transformations as SQL select statements and manage them as version-controlled models.

dbt is the T in ELT.?It doesn't extract or load data, but it's extremely good at transforming data that's already loaded into the warehouse.

There are two types of products dbt offers:

  • dbt Core is an open-source tool that enables data practitioners to transform data and is suitable for users who prefer to manually set up dbt and locally maintain it.
  • dbt Cloud offers the fastest, most reliable, and scalable way to deploy dbt. Allowing data teams to optimize their data transformation by developing, testing, scheduling, and investigating data models using a single, fully managed service through a web-based user interface (UI).

Differences b/w dbt Core and dbt Cloud

dbt Cloud Pricing Plan

Team Plan

The read only seats are included with the purchase of the Developer seats. We will get 5 free read only seats with even a purchase of 1 developer seat on Teams. We can purchase up to 8 developer seats on Teams but the free read only seats?stay at 5.

As an example: If we purchase 1 Developer seat on Teams plan, we will receive 5 free read-only. If we purchase 6 developer seats, we will still have 5 free read-only. If we want to only purchase 6 developer seats on Teams, then the max would be $600/Month.?

Basic Terminologies in dbt

dbt Projects - Basically, a dbt project is a directory on the machine containing everything required to perform transformations on our data. It contains?.sql?files (called models) and YAML files (for configurations).

dbt Project Profiles - A project profile is a YAML file containing the connection details for our chosen data?platform.

dbt model - The atomic unit of transformation in dbt is called a model. Every dbt model defines all of its inputs using a special function called ref().?Models are primarily written as a select statement and saved as a .sql file. We can use CTEs (Common Table Expressions) and apply transforms using SQL

Source - Source tables refer to tables loaded into the warehouse by an Extract Load process.

Snapshots - A snapshot model in dbt is a special kind of model that tracks changes in the data over time.

Staging - The staging area is where raw data is cast into correct data types, given consistent column names, and prepared to be transformed into models used by end-users.

Marts - Marts consist of the core tables for end-users and business vertical-specific tables.

Core - The core defines the fact and dimension models to be used by end-users.

How are data models created?

Every?dbt?model defines?all of?its inputs using a special Jinja function called?ref().?

dbt?can select different parts of the DAG to execute.?

Materializations define how to turn a model into a warehouse?object.?

dbt?has some built-in:?

  • view: CREATE VIEW AS?
  • table: CREATE TABLE AS?
  • incremental: DELETE outdated records, INSERT new records?

{{config( materialized='table' )}}
select ... from?table_name 
 
is rendered?to:
CREATE TABLE ... AS ( select ... from table_name);        
# run all the models 
$ dbt run 

# run one specific model 
$ dbt run --models 

# run a model and those that depend on it 
$ dbt run --models +        

Testing Framework in dbt

dbt provides a powerful framework for testing data models. It allows us to define tests on your models to ensure data quality and integrity.

Inbuilt tests

We can test Uniqueness, Not Null, and Foreign Key constraints etc.

version: 2

models:
  - name: employee
    description: This table contains the employee data loaded 
    columns:
      - name: id
        description: The unique key of employee table
        tests:
          - not_null
          - unique
      - name: emp_name
        description: The Name of an employee        

Custom Tests

We can even define our custom logic to test the models.

-- marks_between.sql

{% test marks_between(model, column_name) %}

with validation as (
    select
        {{ column_name }} as marks,
    from {{ model }}
),

validation_errors as (
    select
        marks
    from validation
    where 
    (marks < 0 OR expected_marks > 100)
)

select *
from validation_errors

{% endtest %}        
# student.yml

version: 2

models:
  - name: student
    description: This table contains the student data loaded
    columns:
      - name: student_id
        description: The unique key of student data
        tests:
          - not_null
        tests:
          - marks_between
        description: The value of the marks        

Debugging and Version Control

dbt provides built-in logging functionality that captures various events and execution statuses during its operation.

It also offers git-based version control.

Auto Documentation and Lineage

dbt generates documentation of the data model, provided the details in yml file.

To generate documentation and serve them, run the following commands:

$ dbt docs generate
$ dbt docs serve        

Data lineage in?dbt? is?represented through the DAG,?which shows how data is?transformed from raw?sources to the final models?used for analysis.

This lineage graph helps?teams understand the?dependencies and?relationships between?different data models, making?it easier to identify potential?issues and inefficiencies.

Incremental Loading in dbt

Dbt's incremental loading mechanism relies on comparing source and target data to determine which records need to be updated or inserted. For large datasets or complex transformations, this comparison process can be resource-intensive and may impact performance, especially when dealing with tables with millions of records.

While dbt provides a simple mechanism for implementing basic incremental loading based on date or timestamp columns, it may lack support for more complex incremental logic, such as delta or CDC (Change Data Capture) mechanisms.

{{
    config(
        materialized='incremental'
    )
}}

select
    *,
    my_slow_function(my_column)
from {{ ref('app_data_events') }}

{% if is_incremental() %}
  where event_time >= (select coalesce(max(event_time), '1900-01-01') from {{ this }})
{% endif %}        

Benefits of Using Databricks with dbt

  • Scalability: Azure Databricks efficiently processes large volumes of data, while dbt manages transformations at scale, ensuring that data models can grow with the business needs.
  • Collaboration and Version Control: dbt encourages collaboration among data team members by integrating with Git for version control. This ensures that changes to data models are tracked and can be reviewed.
  • Testing and Debugging: dbt provides a powerful framework for testing data models. It allows us to define tests on your models to ensure data quality and integrity. Debugging is also more straightforward with dbt's logging and documentation features.
  • Lineage and Dependencies: dbt automatically generates documentation of the data models, including dependency graphs. This makes it easier to understand the flow of data through the various layers of the Medallion Architecture and how tables depend on one another.
  • Efficiency in Building Data Models: With dbt, we can build data models directly on top of the data stored in Databricks. dbt compiles SQL code into Spark SQL, running transformations within Azure Databricks, thus leveraging Databricks' performance and scalability.

Advantages and Disadvantages of using dbt

Advantages

  • dbt helps in modular and incremental approach to do data modeling, allowing teams to iteratively develop and refine data pipelines.
  • Dbt's declarative testing syntax promotes readability and maintainability of test cases, helping in collaboration and knowledge sharing among team members.
  • Dbt itself provides built-in logging functionality that captures various events and execution statuses during its operation.
  • Dbt's support for Git-based version control enables seamless collaboration and coordination among distributed development teams.
  • Data lineage provides a visual graph (often represented as a Directed Acyclic Graph (DAG)) that shows how data flows from raw sources to the final models used for analysis. By tracing back through the lineage, data teams can pinpoint where issues lie and understand the upstream elements impacting their work.

Disadvantages

  • Implementing complex data transformations in SQL may lead to complex code, increasing maintenance overhead and readability challenges.
  • Overhead associated with setting up and maintaining test fixtures and data environments may delay the agility of development cycles, especially in complex data ecosystems.
  • Limited visibility into internal query execution plans and performance optimizations may hinder the diagnosis of performance bottlenecks and query inefficiencies.
  • Managing large repositories with extensive model hierarchies and interdependencies can lead to performance degradation and increased complexity in branching and merging operations.
  • Balancing the granularity of lineage tracking with performance considerations and resource constraints can be challenging, particularly in distributed or federated data architectures with diverse data processing workloads.

Using dbt for Stream Processing

Just with dbt we can’t do Streaming Processing. We need adapters with dbt to do the same.?

dbt is primarily designed for batch-oriented data processing and modeling rather than stream processing. While dbt is not specifically tailored for stream processing, it can still be used in conjunction with stream processing frameworks or tools to handle certain aspects of data transformation, modeling, and analytics.

  • Dbt's batch-oriented processing model may introduce performance overhead when handling high-velocity streaming data streams. Optimizations and performance tuning may be necessary to achieve the desired throughput and latency for stream processing use cases.
  • Stream processing workflows often involve complex event-driven architectures, message queuing systems, and stateful processing requirements. Integrating dbt into such workflows may introduce complexity and performance challenges, especially for high-volume streaming data sources.
  • Leveraging dbt for stream processing requires integration with external stream processing frameworks or tools (e.g., Apache Kafka, Apache Flink). Ensuring seamless interoperability and data consistency between dbt and stream processing components is crucial for successful deployment.
  • Defining comprehensive tests for stream processing workflows in dbt may require specialized expertise and consideration of streaming-specific challenges, such as event ordering, windowing, and event time processing.
  • Stream processing typically requires a continuous allocation of resources to handle incoming data streams in real-time. Integrating dbt into stream processing workflows may increase resource consumption, requiring careful management of compute and storage resources to avoid scalability and cost issues.

Steps for using dbt with Azure Databricks

  • Create a Resource Group on Azure cloud
  • Setup the Storage account for storage container
  • Setup Azure Key Vault for secrets
  • Setup Databricks?
  • Verify Databricks - Key Vault - Secret Scope Integration
  • Setup DBT
  • Configure DBT with Azure Databricks
  • Create DBT Data models with Azure Databricks and ADLS Gen2
  • Create DBT Documentation

References

https://docs.getdbt.com/

dbt Fundamentals | dbt Learn ( getdbt.com )

Using DBT for building a Medallion Lakehouse architecture (Azure Databricks + Delta + DBT) | by Piethein Strengholt | Medium

https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/

https://medium.com/@suffyan.asad1/getting-started-with-dbt-data-build-tool-a-beginners-guide-to-building-data-transformations-28e335be5f7e


要查看或添加评论,请登录

社区洞察