DuckDB x dbt, make psyduck great again !
dbt x duckdb

DuckDB x dbt, make psyduck great again !

In this article we will explore the motivation behind this project as well as how I decide to tackle its implementation.

1. Context

A key feature of my web application is the comprehensive dashboarding for Pokemon sets (e.g., Base Set, Scarlet & Violet – 151) and Pokemon blocks. These dashboards incorporate various charts (Total Sales, Top Price, Monthly Sales, etc.) and Key Performance Indicators (KPIs), as illustrated in Figure 1.

figure 1.1 - Example of a Pokemon Set Dashboard

2. Problem

To tackle the issue, let's examine the data retrieval process between the frontend and backend, both in the previous and current implementations.

Previous Architecture:

When a user accessed a dashboard, the following process occurred:

  1. Communication was established between the frontend and backend.
  2. The backend then queried the database, which had to: Execute all necessary queries for the dashboard, including numerous computationally expensive data aggregations. Return the resultant tables to the backend and store them in the cache.

This approach was inefficient due to daily data refreshes, resulting in repetitive and unnecessary calculations (refer to Figure 2.1).

figure 2.1 - Previous architecture where the OLTP database was doing loads of analytical workloads.

Optimized Architecture:

In the new implementation:

  • When a user accesses a dashboard, the database returns only pre-calculated tables specific to that dashboard, performing read operations without recalculating data.
  • A dockerized transformation pipeline is triggered during daily data insertion, materializing dbt models (SQL) as database tables.
  • DuckDB was chosen to perform all calculations for these tables, given that a cloud data warehouse is not in use.
  • PostgreSQL serves as the OLTP database, efficiently handling numerous small queries with minimal calculations.

This optimized approach is made possible through the synergy of dbt and DuckDB (I always knew Psyduck would be helpful at least ONE time).

figure 2.2 - New architecture implementing a daily refresh of the dashboard tables for each bloc & sets.

3. Implementation

Note:

  • This section does not cover the triggering mechanism of the data pipeline. It is scheduled to run at a specific time when new data has been inserted into the database, checking data freshness and triggering the pipeline if conditions are met.
  • The implementation described here is one of many possible approaches, each with its own strengths and limitations. These will be discussed later in the article (see Advantages & Limitations section).

A. Overview

The data pipeline workflow consists of the following steps:

  1. Establish a connection to the DuckDB database and import new data from PostgreSQL.
  2. Perform a git pull from the DuckDB project to retrieve new features of the dbt project. If the project doesn't exist, clone it directly from GitHub. Checkout the "prod" branch.
  3. Execute dbt debug to verify the configuration (profiles, dbt_project, and connection to DuckDB).
  4. Run "dbt build --target duckdb" to: a. Execute the models b. Test the models c. Load base data (seeds) (optional) d. Take a data snapshot (optional)
  5. Use materialized models (tables) in DuckDB to update PostgreSQL tables.

This workflow includes comprehensive testing (not detailed above).

B. Docker Environment Setup.

This pipeline only needs to be executed daily. Therefore, using a docker container which will only be running the time of the pipeline is the go to solution. A basic docker-compose.yml setup was used (based on the airflow-provided configuration with custom modifications). Volumes were added to maintain persistent data on the host machine, though this is optional.

volumes:
    - ${AIRFLOW_PROJ_DIR:-.}:/opt/airflow
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
    - ${AIRFLOW_PROJ_DIR:-.}/dbt_project:/opt/airflow/dbt_project
    - ${AIRFLOW_PROJ_DIR:-.}/duckdb:/opt/airflow/duckdb
    - ${AIRFLOW_PROJ_DIR:-.}/functions:/opt/airflow/plugins/functions
        

On the other side, the .dockerfile is a such :

# Use the official Airflow image
FROM apache/airflow:latest

USER root

# Install git and other dependencies
RUN apt-get update && apt-get install -y \\
    git \\
    && apt-get clean \\
    && rm -rf /var/lib/apt/lists/*

# Set permissions on Airflow directories and other directories as needed
RUN chmod -R 777 /opt/airflow

USER airflow

# Copy the requirements file
COPY requirements.txt /opt/airflow/

# Install the dependencies
RUN pip install -r /opt/airflow/requirements.txt

# Copy the .env file
COPY .env /opt/airflow/.env

# Set the environment variables
ENV AIRFLOW_HOME=/opt/airflow
        

The packages are : python-dotenv | duckdb | dbt-core==1.8.2 | dbt-duckdb==1.8.1

C. Detailed Implementation steps

1. Updating Data in DuckDB

This step involves importing new data from PostgreSQL that is not yet present in DuckDB. The process is as follows:

figure 3.1 - establishing a connection to the duckdb database.

A connection is trying to be established. If it succeeds then we “attach” the postgres database to duckdb.

figure 3.2 - Attaching the postgres database to duckdb.

But what “attach” actually means ?

According to duckdb official documention :

The?ATTACH?statement adds a new database file to the catalog that can be read from and written to.

The next step consists of updating the data in duckdb. However, there are several scenarios that can happends, so it is important to take them in consideration if any one of them happends to not make the pipeline fail.

test test test : rest rest rest

The most common scenarios that I decided to covered are :

  • S1: DuckDB data is older than PostgreSQL data ? Update DuckDB with new data from PostgreSQL (expected scenario)
  • S2: DuckDB database is empty ? Load all data from PostgreSQL to DuckDB
  • S3: DuckDB and PostgreSQL data freshness are equal ? Refresh the data
  • S4: DuckDB data is newer than PostgreSQL ? Abort the pipeline and send Slack notifications with logs

Here is the code implementing this management of some scenarios:

figure 3.3 - S1 : updating duckdb db with new data from postgres.


figure 3.4 - managing S2 (empty duckdb database)

2. Setting up & running the dbt project

This step involves: a. Cloning the dbt project from GitHub if it doesn't exist, or performing a git pull to fetch new features. b. Checking out the "prod" branch. c. Setting up the dbt profile (Figure 3.6). d. Running dbt debug to verify the configuration. e. Executing "dbt build --target prod" to build all models.

Note: To address permission issues in the Docker environment (Linux), specific git and Linux commands were used (e.g., git config --global --add safe.directory dbt_path).

figure 3.5 - cloning/pulling dbt_project

The next steps consists of using the “prod” branch of the project.

it is pretty basic, we just run a “git checkout prod”.

It is also necessary to set up the dbt profile. I created it in a specific task which is not necessary again. You can just put it in the dbt project when you clone it.

figure 3.6 - setting up the dbt profiles.yml file

Finally comes the time to build all the models. A dbt debug is mandatory to check if the dbt configuration (database connection + profile) is done correctly.

Then we just simply run a “dbt build —target prod” to build all the models.

Here are the airflow logs of this task :

10:34:36  Configuration:
10:34:36    profiles.yml file [OK found and valid]
10:34:36    dbt_project.yml file [OK found and valid]
10:34:36  Required dependencies:
10:34:36   - git [OK found]
10:34:36  Connection:
10:34:36    database: ebay_sales_db
10:34:36    schema: main
10:34:36    path: /opt/***/duckdb/ebay_sales_db.duckdb
10:34:36    config_options: None
10:34:36    extensions: None
10:34:36    settings: None
10:34:36    external_root: .
10:34:36    use_credential_provider: None
10:34:36    attach: None
10:34:36    filesystems: None
10:34:36    remote: None
10:34:36    plugins: None
10:34:36    disable_transactions: False
10:34:36  Registered adapter: duckdb=1.8.1
10:34:36    Connection test: [OK connection ok]
10:34:36  All checks passed!
[2024-06-21, 10:41:10 UTC] {dbt_duckdb.py:212} INFO - DBT build output:
10:34:38  Running with dbt=1.8.2
10:34:38  Registered adapter: duckdb=1.8.1
10:34:58  Found 545 models, 2449 data tests, 3 sources, 793 macros
building the models....
.....
.....
10:41:08  Finished running 2449 data tests, 7 view models, 538 table models in 0 hours 6 minutes and 10.31 seconds (370.31s).
        

3. Updating tables in postgres database

To ensure the safe and accurate update of PostgreSQL tables, a comprehensive validation process is implemented. This process involves a series of tests to verify that the models created in DuckDB align with expectations. These tests encompass various aspects, including:

  1. Row count verification
  2. Model name consistency
  3. Data type matching
  4. Key field validation
  5. Data integrity checks

This rigorous testing phase effectively serves as a source-to-target mapping, ensuring that all data is properly prepared and validated before any updates are made to the PostgreSQL database.

Following successful validation, the data update process is initiated. This process follows a two-pronged approach:

  1. For non-existent tables in PostgreSQL, new tables are created and populated with data from the corresponding DuckDB models.
  2. For existing tables, a replace operation is performed, where the current data in PostgreSQL is overwritten with the fresh, validated data from DuckDB.

This methodology ensures that the PostgreSQL database is always synchronized with the most current and accurate data, while maintaining data integrity throughout the entire process.

It is done as such :

figure 3.7 - updating postgres data

4. Advantages & Limitations

a) Advantages

This architecture brings several benefits to the application, ranging from the separation of transactional and analytical workloads to the automation of daily data refresh processes.

  • Separation of Transactional and Analytical Workloads Postgres serves as the backbone of the application, holding all Pokemon-related data (sales data, dashboard data, etc.). The database is frequently queried for small transactions (primarily reads), functioning as an OLTP database not designed for heavy analytical workloads. Utilizing DuckDB, an OLAP database, to run analytical queries allows Postgres to remain more available, thereby increasing performance and reducing bottlenecks.
  • Data Reliability dbt offers the capability to test materialized models, which enhances data quality (non-null values, unique values, date formats, column types, etc.). Consequently, when these tables are utilized by the application, issues such as "undefined values" or encoding problems are mitigated.
  • Automated Daily Data Refresh Leveraging Airflow as a job orchestrator, the application becomes fully autonomous: new data is inserted into the database daily, and dashboard tables are refreshed concurrently. This provides users with a seamless experience while allowing developers to focus on other priorities.
  • Scalability and Flexibility The modular nature of this architecture allows for easy scaling of individual components as the application grows. It also provides the flexibility to swap out components if needed, such as replacing DuckDB with a different OLAP solution in the future.

b) Limitations

  • Increased Complexity This architecture introduces two new tools: DuckDB and dbt. While they bring significant benefits, they also increase the complexity of the data stack, potentially introducing more points of failure. Postgres could have been used as the query engine with dbt, but this would have diminished the project's innovative aspect and performance benefits.
  • Learning Curve Implementing and maintaining this architecture requires familiarity with multiple tools and technologies, which may necessitate additional training or expertise.
  • Resource Overhead Running multiple systems (Postgres, DuckDB, dbt, Airflow) requires more computational resources compared to a simpler, monolithic approach.

5. Conclusion

This project demonstrates the potential of combining dbt and DuckDB for medium-sized data use cases. The architecture showcases an efficient, on-premises solution that balances performance, reliability, and automation.

While the current implementation is fully on-premise (with data file replication in GCP buckets for redundancy), there is potential for further enhancement through a cloud-native approach. This could involve migrating all data from Postgres to an object storage solution with an appropriate table format such as Apache Iceberg, which DuckDB can query efficiently. This evolution would effectively create a data lakehouse architecture.

In this enhanced scenario, all models would be materialized by dbt in the data lake, and the application would read from the cloud-based data lakehouse instead of the on-premise Postgres database. This approach would offer improved scalability, cost-effectiveness, and alignment with modern data engineering practices.

The project underscores the importance of choosing the right tools for specific data challenges and demonstrates how innovative combinations of technologies can yield significant benefits in data processing and analysis workflows.

要查看或添加评论,请登录

Jean Guinvarch的更多文章

社区洞察

其他会员也浏览了