登录查看更多内容

Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ?

Abdelbarre Chafik

Senior Data Engineer

发布日期: 2023年7月27日

Now that you’ve mastered the basics of dbt and Databricks integration (Article 1), let’s take it a step further and explore how to use models and tests to ensure data quality and accuracy in your data transformation workflows. By leveraging dbt’s modeling capabilities and testing framework, you can confidently transform your data with precision. Let’s dive in!

?? Defining?Models

1?? Create a Model:

In your dbt project’s `models` directory, you can build models on top of other models to create complex transformations. Simply reference the existing models in your new model definition. For example:

In this example, we create a model called `customers` that references stg_customers and stg_orders. This allows us to perform any necessary data cleaning or transformations specific to the customer data.

To do this

Create a new SQL file, models/stg_customers.sql, with the SQL from the customers CTE in our original query.
Create a second new SQL file, models/stg_orders.sql, with the SQL from the orders CTE in our original query.

select
    id as customer_id,
    first_name,
    last_name

from dbt_achafik.jaffle_shop_customers
select
    id as order_id,
    user_id as customer_id,
    order_date,
    status

from dbt_achafik.jaffle_shop_orders

2?? Build Models on Top: To build a model on top of the `customer` model, Edit the SQL in your models/customers.sql file as follows and reference the existing model. For instance:

with customers as (

    select * from {{ ref('stg_customers') }}

),

orders as (

    select * from {{ ref('stg_orders') }}

),

customer_orders as (

    select
        customer_id,

        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders

    from orders

    group by 1

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from customers

    left join customer_orders using (customer_id)

)

select * from final

Here, we’re building the `customer` model on top of the `stg_orders` and stg_customers model. By reusing and extending existing models, you can create a layered and modular approach to your data transformations.

Execute

dbt run

This time, when you performed a dbt run, separate views/tables were created for stg_customers, stg_orders and customers. dbt inferred the order to run these models. Because customers depends on stg_customers and stg_orders, dbt builds customers last. You do not need to explicitly define these dependencies.

领英推荐

Gainers and Losers in Gartner 2018 Magic Quadrant for…

Gregory Piatetsky-Shapiro 7 年前

Architecting Data Pipelines

Amit Khullar 1 年前

?? DATA Pill #108 - Orchestrating 2000+ dbt Models…

Adam Kawa 9 个月前

?? Writing?Tests

1?? Define Tests:

Writing tests for models built on top of other models follow a similar approach. Create separate test files and reference the corresponding models. For example:

Create a new YAML file in the models directory, named models/schema.yml

version: 2

models:
  - name: customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

2?? Run Tests:

Execute the tests using the

Run dbt test

dbt test

and confirm that all your tests passed.

dbt will run the defined tests against the transformed data in Databricks, ensuring the integrity and accuracy of your data transformations.

? Data Quality Assurance Achieved!

By building models on top of other models in your dbt project, you can create complex and interconnected data transformations while maintaining a modular and reusable structure. The tests validate the column structure, aggregations, and any other criteria you define, ensuring the quality and accuracy of your data.

?? Next Steps: Experiment with additional models, explore different testing scenarios, and fine-tune your data transformation workflows. Stay tuned for more articles where we’ll delve deeper into advanced techniques and best practices for using dbt and Databricks effectively!

#dbt #Databricks #DataTransformation #DataAnalytics #DataEngineering #DataQuality #Testing

要查看或添加评论，请登录

Abdelbarre Chafik的更多文章

??Article 1: Setting Up dbt with Databricks ??: Supercharge Your Data Transformations!

2023年7月12日

??Article 1: Setting Up dbt with Databricks ??: Supercharge Your Data Transformations!

Hey there, Are you looking to supercharge your data transformations using the powerful combination of dbt and…
Maximizing Efficiency and Productivity with Great Expectations on Databricks ??

2023年6月27日

Maximizing Efficiency and Productivity with Great Expectations on Databricks ??

Data quality is crucial for organizations relying on data-driven insights to make informed decisions. Accurate…
Step by step : Seamlessly Send Messages from Databricks to Slack

2023年6月21日

Step by step : Seamlessly Send Messages from Databricks to Slack

Hey #DataScientists and #DataEngineers, I wanted to share a fantastic integration that will level up your collaboration…

Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ?

Abdelbarre Chafik

Senior Data Engineer

?? Defining?Models

领英推荐

?? Writing?Tests

? Data Quality Assurance Achieved!

Abdelbarre Chafik的更多文章

社区洞察

其他会员也浏览了

December 2024 Top Ten (by Dagster Labs)

Introduction to DBT (Data Build Tool)

Exposing dbt models in Looker

DBT ZERO TO HERO

Building a Data Pipeline in Microsoft Fabric & Transforming Data with PySpark

Top 3 Analytical Applications for 2025

The Natural Evolution of Data Platforms

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

Essential dbt Concepts and Best Practices for Data Transformation Projects

Looker and dbt: an Unfinished Business

?? Defining?Models

领英推荐

?? Writing?Tests

? Data Quality Assurance Achieved!

Abdelbarre Chafik的更多文章

??Article 1: Setting Up dbt with Databricks ??: Supercharge Your Data Transformations!

Maximizing Efficiency and Productivity with Great Expectations on Databricks ??

Step by step : Seamlessly Send Messages from Databricks to Slack

社区洞察

其他会员也浏览了

December 2024 Top Ten (by Dagster Labs)

Introduction to DBT (Data Build Tool)

Exposing dbt models in Looker

DBT ZERO TO HERO

Building a Data Pipeline in Microsoft Fabric & Transforming Data with PySpark

Top 3 Analytical Applications for 2025

The Natural Evolution of Data Platforms

No Schema Required: Moving Unstructured JSON Messages from Kafka to Delta Lake with Variant DataType with SparkStreaming | Real Time | Hands on labs

Essential dbt Concepts and Best Practices for Data Transformation Projects

Looker and dbt: an Unfinished Business