DBT: Capture The In-house Data Flow
I luckily witnessed the transition from Airflow-scheduled SQL operators to Dbt in one of the companies I worked for and saw the impacts of Dbt on streamlining the data flow. Hence, this article describes my experience with Dbt Core as a tool to build and monitor in-house data movements with SQL. It consists of two sections
1. Historical movements
In this section, we first look at the role of Dbt in our Extract-Load-Transform (ELT) flow. Feel free to skip to Part 2 (DBT Useful Components) if you are only interested in the tech behind Dbt.
Let's assume we start as a lumberjack, trying to build up our wood storage. Since the village is tiny at the start, we prefer preprocessing the tree outdoors before transporting it to our warehouse. This scenario is similar to an ETL pipeline.
Certainly, working outdoors can be dangerous, with limited resources. Once the village grows, we desire to bring trees back home, store them and process them later. This allows our carpenters to serve different furniture design needs without worrying about doing the correct thing during the first preprocessing step, and since the woods are cached, we can try fast and fail iteratively.
Now comes the scaling issue. If one house must have one resident, it is troublesome when we want to store a huge amount of wood but process a little because many residents would be idle. Vice versa, when we have a small amount of wood, but we want to process fast by employing 1000 residents, we can't. This leads to the need of decoupling houses and residents. Let's have a separate storage house while keeping our resident group flexible and on-demand.
That's it, our village quickly becomes a saw-mill in no time. Thanks to the advancement in
But, where does DBT stand in the whole pipeline?
The short answer is: Dbt manages and triggers SQL runs against your database, warehouse, lake, or query engine.
The longer answer is: Dbt managed the SQL scripts in a well-manner fashion and beyond, including Medallion data models, materialization, data quality checks, and documentation. Its integration with Airflow/Databricks allows active developments in modern data systems.
Let's have a look at the next section to understand more.
2. DBT Useful Components
If you're a carpenter constructing a house, DBT is the blueprint managing assistant.
Let's explore each component by building up an Airbnb sample data flow.
Documentation (Interface)
The promise from Dbt is that you will have a GUI to view the constructed data flow from the start and it is progressively updated along with your build. It includes a nice view of the overall DAG flow and also offers a nice GUI (similar to Airflow, right!) that describes your tables and dependencies.
This is set up under locations [models/docs.md, schema.yml] and compiled by these command lines:
dbt docs generate
dbt docs serve
You can add descriptions or docs of models under models/docs.md and refer it under models/schema.yml like this.
Model (Blueprints) - The Backbone
Suppose you want to build the Airbnb data flow below where tables/views are derived from one another.
The steps are
1. Import data sources in raw layers into your DB. Here I use Snowflake.
2. Set up Dbt Core (locally), initiate, and connect to Snowflake.
3. Under the models directory, add stages and internal SQL scripts to create each component.
This is where you define the blocks and connections for your flow with SQL scripts and placeholders. We add references among other tables in SQL scripts, which helps DBT construct the flow we need.
To run individual models, or run upstream and downstream models, use the following cmd
dbt run --select my_model
dbt run --select +my_model
dbt run --select my_model+
To summarize, data models are defined under locations [models/layer_name/*.sql]. To compile, the following cmd generates executable SQL from the source models, tests, and analysis files.
dbt compile
Reference link.
Slowly changing dimensions (SCD)
To think along the models, we need to understand the SCD concept. Basically, it refers to the classification of each data model with these types
As the descriptions suggest, this concept is super useful when we construct a flow of data models:
Materialization
Dbt supports different types of materializations for each SQL you created.
You can specify materialization for a group of tables in the yml config file or at the beginning of each SQL (which overwrites the common config in yml file).
To summarize, materialization is defined under locations [models/*.sql, dbt_project.yml].
Tests
This is one of the strong points of Dbt, which is inspired by Great Expectations (https://docs.greatexpectations.io/docs).
The general command line to run all the tests, or for selected tables, is
dbt test (--select table)
To write tests, there are two options
Built-in functions to test the properties of columns in tables.
Example: no_null, unique, values, relationships
领英推荐
2. Singular tests: locations [tests/*.sql]
Custom SQL tests that pass if generate no result.
What if we want custom function tests that can be used under the schema.yml? I am glad that you asked ^^. The below feature supports it.
Macros, Custom Generic Tests and Packages
Macros?in Jinja are pieces of code that can be reused multiple times – they are analogous to "functions" in other programming languages and are extremely useful if you find yourself repeating code across multiple models. Macros are defined in?.sql?files, typically in your?macros?directory. Custom generic tests rely on this feature as well.
We can use it to write custom tests and apply them to either singular tests or generic tests. That's the beauty of dbt!
Singular tests
Generic tests
More details can be found here.
3. Packages
We can install any dbt package to extend the functions of dbt. Specify under ./packages.yml and install with
dbt deps
For example, the resource https://hub.getdbt.com/dbt-labs/dbt_utils/latest/ includes many built-in macros for tests.
Source Freshness
An interesting feature of dbt is raising warnings/errors when source tables are not updated regularly.
This is defined under location [models/sources.yml] and checked with the cmd
dbt source freshness
Snapshots
Auto-generate tracking columns for SCD Type-2 snapshots for a table:?
For example, an Airbnb host updates his minimum nights of stay from 30 to 62. With incremental loading, we can simply add one more row and dbt helps us to know which row is valid by auto-adding two columns dbt_valid_from and dbt_valid_to.
Analysis, Hooks, and Exposures
Analysis under analysis/*.sql are data to feed to BI tools (not materialized).
Hooks are SQLS which are executed at predefined moments. They can be configured to run on different levels: project, directory or model.
Hook types:
Example: Grant the Reporter access to all tables under models.
Exposures
Links external dashboard links to the Doc by configuring dashboard.yml so that the dashboard will be shown under the Doc site.
3. Advanced Topics
Testing with Great Expectations
Great Expectations is a data quality testing framework (link). This inspires dbt to port it over (link) for the Test feature.
Test functions, plus severity (warn) can be set up.
To debug tests, run
dbt --debug test (--select …)
Another easier way to locate the SQL path which causes errors and run under Snowflake to check.
Execution & Orchestration
Dbt does not run on its own but leverages the computing power of Database engines (Snowflake, Bigquery, Redshift), Computing cluster (Databricks, Kubernetes).
For orchestration, it can be run under
Core dbt: run as a bash operator. Remember that we can run dbt on selected models to fit in an Airflow flow ;)
Dbt cloud: use Dbtcloud operator directly (link).
Reference can be found here https://www.getdbt.com/blog/dbt-airflow.
2. Databricks Workflow
Dbt can also be added under Databricks workflow, thanks to the relentless effort between the two teams.
Finally, Best Practices!
Don't forget to follow DBT best practices https://docs.getdbt.com/guides/best-practices.
Talk to you soon!
Resources