Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ?

Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ?

Now that you’ve mastered the basics of dbt and Databricks integration (Article 1), let’s take it a step further and explore how to use models and tests to ensure data quality and accuracy in your data transformation workflows. By leveraging dbt’s modeling capabilities and testing framework, you can confidently transform your data with precision. Let’s dive in!

?? Defining?Models

1?? Create a Model:

In your dbt project’s `models` directory, you can build models on top of other models to create complex transformations. Simply reference the existing models in your new model definition. For example:

No alt text provided for this image


In this example, we create a model called `customers` that references stg_customers and stg_orders. This allows us to perform any necessary data cleaning or transformations specific to the customer data.

To do this

  1. Create a new SQL file, models/stg_customers.sql, with the SQL from the customers CTE in our original query.
  2. Create a second new SQL file, models/stg_orders.sql, with the SQL from the orders CTE in our original query.

No alt text provided for this image


select
    id as customer_id,
    first_name,
    last_name

from dbt_achafik.jaffle_shop_customers
select
    id as order_id,
    user_id as customer_id,
    order_date,
    status

from dbt_achafik.jaffle_shop_orders        

2?? Build Models on Top: To build a model on top of the `customer` model, Edit the SQL in your models/customers.sql file as follows and reference the existing model. For instance:

with customers as (

    select * from {{ ref('stg_customers') }}

),

orders as (

    select * from {{ ref('stg_orders') }}

),

customer_orders as (

    select
        customer_id,

        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date,
        count(order_id) as number_of_orders

    from orders

    group by 1

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order_date,
        customer_orders.most_recent_order_date,
        coalesce(customer_orders.number_of_orders, 0) as number_of_orders

    from customers

    left join customer_orders using (customer_id)

)

select * from final        

Here, we’re building the `customer` model on top of the `stg_orders` and stg_customers model. By reusing and extending existing models, you can create a layered and modular approach to your data transformations.

Execute

dbt run        

This time, when you performed a dbt run, separate views/tables were created for stg_customers, stg_orders and customers. dbt inferred the order to run these models. Because customers depends on stg_customers and stg_orders, dbt builds customers last. You do not need to explicitly define these dependencies.

No alt text provided for this image
No alt text provided for this image

?? Writing?Tests

1?? Define Tests:

Writing tests for models built on top of other models follow a similar approach. Create separate test files and reference the corresponding models. For example:

Create a new YAML file in the models directory, named models/schema.yml

version: 2

models:
  - name: customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id        

2?? Run Tests:

Execute the tests using the

  1. Run dbt test

dbt test        

and confirm that all your tests passed.

dbt will run the defined tests against the transformed data in Databricks, ensuring the integrity and accuracy of your data transformations.

No alt text provided for this image

? Data Quality Assurance Achieved!

By building models on top of other models in your dbt project, you can create complex and interconnected data transformations while maintaining a modular and reusable structure. The tests validate the column structure, aggregations, and any other criteria you define, ensuring the quality and accuracy of your data.

?? Next Steps: Experiment with additional models, explore different testing scenarios, and fine-tune your data transformation workflows. Stay tuned for more articles where we’ll delve deeper into advanced techniques and best practices for using dbt and Databricks effectively!

#dbt #Databricks #DataTransformation #DataAnalytics #DataEngineering #DataQuality #Testing

要查看或添加评论,请登录

Abdelbarre Chafik的更多文章

社区洞察

其他会员也浏览了