DBT: Capture The In-house Data Flow

DBT: Capture The In-house Data Flow

I luckily witnessed the transition from Airflow-scheduled SQL operators to Dbt in one of the companies I worked for and saw the impacts of Dbt on streamlining the data flow. Hence, this article describes my experience with Dbt Core as a tool to build and monitor in-house data movements with SQL. It consists of two sections

  1. Historical movements
  2. Dbt useful components
  3. Some advanced features

1. Historical movements

In this section, we first look at the role of Dbt in our Extract-Load-Transform (ELT) flow. Feel free to skip to Part 2 (DBT Useful Components) if you are only interested in the tech behind Dbt.

Let's assume we start as a lumberjack, trying to build up our wood storage. Since the village is tiny at the start, we prefer preprocessing the tree outdoors before transporting it to our warehouse. This scenario is similar to an ETL pipeline.

No alt text provided for this image
ETL

Certainly, working outdoors can be dangerous, with limited resources. Once the village grows, we desire to bring trees back home, store them and process them later. This allows our carpenters to serve different furniture design needs without worrying about doing the correct thing during the first preprocessing step, and since the woods are cached, we can try fast and fail iteratively.

No alt text provided for this image
ELT

Now comes the scaling issue. If one house must have one resident, it is troublesome when we want to store a huge amount of wood but process a little because many residents would be idle. Vice versa, when we have a small amount of wood, but we want to process fast by employing 1000 residents, we can't. This leads to the need of decoupling houses and residents. Let's have a separate storage house while keeping our resident group flexible and on-demand.

No alt text provided for this image
Storage/Compute Decoupling.

That's it, our village quickly becomes a saw-mill in no time. Thanks to the advancement in

  • Cheap storage: allows for the storage of more data with redundancy.
  • Fast network: enables data movements between storage and computing clusters.
  • Strong computing: decoupling storage & computing encourages the growth of computing powerhouses such as Databricks, Bigquery, etc.

But, where does DBT stand in the whole pipeline?

No alt text provided for this image
Modern Data Stack

The short answer is: Dbt manages and triggers SQL runs against your database, warehouse, lake, or query engine.

The longer answer is: Dbt managed the SQL scripts in a well-manner fashion and beyond, including Medallion data models, materialization, data quality checks, and documentation. Its integration with Airflow/Databricks allows active developments in modern data systems.

Let's have a look at the next section to understand more.

2. DBT Useful Components

No alt text provided for this image
DBT Components (Key components are highlighted in Red)

If you're a carpenter constructing a house, DBT is the blueprint managing assistant.

  • DBT keeps track of all blueprints as code, so you can have versioning.
  • It manages the order of execution stages, like building the foundation, the frame, walls, doors, etc. [Equivalent to Medallion data layers]. Note that it has to trigger the residents to actually execute.
  • For each house component, you can specify the building flow, and DBT triggers the execution in the exact order.
  • It enables quality checks of each component
  • And many other features are under development

Let's explore each component by building up an Airbnb sample data flow.

Documentation (Interface)

The promise from Dbt is that you will have a GUI to view the constructed data flow from the start and it is progressively updated along with your build. It includes a nice view of the overall DAG flow and also offers a nice GUI (similar to Airflow, right!) that describes your tables and dependencies.

No alt text provided for this image
No alt text provided for this image

This is set up under locations [models/docs.md, schema.yml] and compiled by these command lines:

dbt docs generate
dbt docs serve        

You can add descriptions or docs of models under models/docs.md and refer it under models/schema.yml like this.

No alt text provided for this image

Model (Blueprints) - The Backbone

Suppose you want to build the Airbnb data flow below where tables/views are derived from one another.

No alt text provided for this image
Data flow example

The steps are

1. Import data sources in raw layers into your DB. Here I use Snowflake.

No alt text provided for this image

2. Set up Dbt Core (locally), initiate, and connect to Snowflake.

3. Under the models directory, add stages and internal SQL scripts to create each component.

No alt text provided for this image
SQL scripts under DBT

This is where you define the blocks and connections for your flow with SQL scripts and placeholders. We add references among other tables in SQL scripts, which helps DBT construct the flow we need.

To run individual models, or run upstream and downstream models, use the following cmd

dbt run --select my_model
dbt run --select +my_model 
dbt run --select my_model+         

To summarize, data models are defined under locations [models/layer_name/*.sql]. To compile, the following cmd generates executable SQL from the source models, tests, and analysis files.

dbt compile        

Reference link.

Slowly changing dimensions (SCD)

To think along the models, we need to understand the SCD concept. Basically, it refers to the classification of each data model with these types

  • SCD 0: Not change in DWH, only in source tables.
  • SCD 1: Replace?
  • SCD 2: Keep all, adding rows.
  • SCD 3: Keep only one previous record, adding additional cols.

As the descriptions suggest, this concept is super useful when we construct a flow of data models:

  1. SCD1: used when we want to refresh the whole data model to sync with source data. It is good when the data is small, is rarely-changed dimensional data (e.g. a dictionary of country codes and country details), or is simply inexpensive to fully refresh.
  2. SCD2: useful when we want to incrementally load transactional data (e.g. orders in an e-commerce platform). It would be too expensive to fully refresh, or perhaps the historical data source is no longer available.

Materialization

Dbt supports different types of materializations for each SQL you created.

  • View
  • Table
  • Incremental (table appends): use it on incremental loads
  • Ephemeral (CTE): if you find yourself aggregating?a CTE, split it into a separate model. You can still perform tests on this.

You can specify materialization for a group of tables in the yml config file or at the beginning of each SQL (which overwrites the common config in yml file).

No alt text provided for this image
Materialization under project config.
No alt text provided for this image
Materialization under SQL

To summarize, materialization is defined under locations [models/*.sql, dbt_project.yml].

Tests

This is one of the strong points of Dbt, which is inspired by Great Expectations (https://docs.greatexpectations.io/docs).

The general command line to run all the tests, or for selected tables, is

dbt test (--select table)        

To write tests, there are two options

  1. Generic tests: locations [models/schema.yml]

Built-in functions to test the properties of columns in tables.

Example: no_null, unique, values, relationships

No alt text provided for this image

2. Singular tests: locations [tests/*.sql]

Custom SQL tests that pass if generate no result.

No alt text provided for this image

What if we want custom function tests that can be used under the schema.yml? I am glad that you asked ^^. The below feature supports it.

Macros, Custom Generic Tests and Packages

  1. Macro & Custom generic tests: locations [macros/*.sql]

Macros?in Jinja are pieces of code that can be reused multiple times – they are analogous to "functions" in other programming languages and are extremely useful if you find yourself repeating code across multiple models. Macros are defined in?.sql?files, typically in your?macros?directory. Custom generic tests rely on this feature as well.

We can use it to write custom tests and apply them to either singular tests or generic tests. That's the beauty of dbt!

Singular tests

No alt text provided for this image

Generic tests

No alt text provided for this image

More details can be found here.

3. Packages

We can install any dbt package to extend the functions of dbt. Specify under ./packages.yml and install with

dbt deps        

For example, the resource https://hub.getdbt.com/dbt-labs/dbt_utils/latest/ includes many built-in macros for tests.

Source Freshness

An interesting feature of dbt is raising warnings/errors when source tables are not updated regularly.

No alt text provided for this image

This is defined under location [models/sources.yml] and checked with the cmd

dbt source freshness        

Snapshots

Auto-generate tracking columns for SCD Type-2 snapshots for a table:?

  • Valid_from: timestamp
  • Valid_to: timestamp, NULL if current

For example, an Airbnb host updates his minimum nights of stay from 30 to 62. With incremental loading, we can simply add one more row and dbt helps us to know which row is valid by auto-adding two columns dbt_valid_from and dbt_valid_to.

No alt text provided for this image

Analysis, Hooks, and Exposures

Analysis under analysis/*.sql are data to feed to BI tools (not materialized).

Hooks are SQLS which are executed at predefined moments. They can be configured to run on different levels: project, directory or model.

Hook types:

  • pre-hook: executed?before?a model, seed or snapshot is built.
  • post-hook: executed?after?a model, seed or snapshot is built.
  • on-run-start: executed at the?start?of?dbt run,?dbt test,?dbt seed?or?dbt snapshot
  • on-run-end: executed at the?end?of?dbt run,?dbt test,?dbt seed?or?dbt snapshot

Example: Grant the Reporter access to all tables under models.

No alt text provided for this image

Exposures

Links external dashboard links to the Doc by configuring dashboard.yml so that the dashboard will be shown under the Doc site.

No alt text provided for this image
No alt text provided for this image


3. Advanced Topics

Testing with Great Expectations

Great Expectations is a data quality testing framework (link). This inspires dbt to port it over (link) for the Test feature.

Test functions, plus severity (warn) can be set up.

No alt text provided for this image

To debug tests, run

dbt --debug test (--select …)         

Another easier way to locate the SQL path which causes errors and run under Snowflake to check.

Execution & Orchestration

Dbt does not run on its own but leverages the computing power of Database engines (Snowflake, Bigquery, Redshift), Computing cluster (Databricks, Kubernetes).

For orchestration, it can be run under

  1. Airflow

Core dbt: run as a bash operator. Remember that we can run dbt on selected models to fit in an Airflow flow ;)

Dbt cloud: use Dbtcloud operator directly (link).

No alt text provided for this image

Reference can be found here https://www.getdbt.com/blog/dbt-airflow.

2. Databricks Workflow

Dbt can also be added under Databricks workflow, thanks to the relentless effort between the two teams.

No alt text provided for this image

Finally, Best Practices!

Don't forget to follow DBT best practices https://docs.getdbt.com/guides/best-practices.

Talk to you soon!


Resources

  1. Dbt best practices (link)
  2. Udemy: The Complete dbt (Data Build Tool) Bootcamp: Zero to Hero (link)


要查看或添加评论,请登录

社区洞察

其他会员也浏览了