Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ?
Now that you’ve mastered the basics of dbt and Databricks integration (Article 1), let’s take it a step further and explore how to use models and tests to ensure data quality and accuracy in your data transformation workflows. By leveraging dbt’s modeling capabilities and testing framework, you can confidently transform your data with precision. Let’s dive in!
?? Defining?Models
1?? Create a Model:
In your dbt project’s `models` directory, you can build models on top of other models to create complex transformations. Simply reference the existing models in your new model definition. For example:
In this example, we create a model called `customers` that references stg_customers and stg_orders. This allows us to perform any necessary data cleaning or transformations specific to the customer data.
To do this
select
id as customer_id,
first_name,
last_name
from dbt_achafik.jaffle_shop_customers
select
id as order_id,
user_id as customer_id,
order_date,
status
from dbt_achafik.jaffle_shop_orders
2?? Build Models on Top: To build a model on top of the `customer` model, Edit the SQL in your models/customers.sql file as follows and reference the existing model. For instance:
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final
Here, we’re building the `customer` model on top of the `stg_orders` and stg_customers model. By reusing and extending existing models, you can create a layered and modular approach to your data transformations.
Execute
dbt run
This time, when you performed a dbt run, separate views/tables were created for stg_customers, stg_orders and customers. dbt inferred the order to run these models. Because customers depends on stg_customers and stg_orders, dbt builds customers last. You do not need to explicitly define these dependencies.
领英推荐
?? Writing?Tests
1?? Define Tests:
Writing tests for models built on top of other models follow a similar approach. Create separate test files and reference the corresponding models. For example:
Create a new YAML file in the models directory, named models/schema.yml
version: 2
models:
- name: customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
2?? Run Tests:
Execute the tests using the
dbt test
and confirm that all your tests passed.
dbt will run the defined tests against the transformed data in Databricks, ensuring the integrity and accuracy of your data transformations.
? Data Quality Assurance Achieved!
By building models on top of other models in your dbt project, you can create complex and interconnected data transformations while maintaining a modular and reusable structure. The tests validate the column structure, aggregations, and any other criteria you define, ensuring the quality and accuracy of your data.
?? Next Steps: Experiment with additional models, explore different testing scenarios, and fine-tune your data transformation workflows. Stay tuned for more articles where we’ll delve deeper into advanced techniques and best practices for using dbt and Databricks effectively!