DuckDB x dbt, make psyduck great again !
In this article we will explore the motivation behind this project as well as how I decide to tackle its implementation.
1. Context
A key feature of my web application is the comprehensive dashboarding for Pokemon sets (e.g., Base Set, Scarlet & Violet – 151) and Pokemon blocks. These dashboards incorporate various charts (Total Sales, Top Price, Monthly Sales, etc.) and Key Performance Indicators (KPIs), as illustrated in Figure 1.
2. Problem
To tackle the issue, let's examine the data retrieval process between the frontend and backend, both in the previous and current implementations.
Previous Architecture:
When a user accessed a dashboard, the following process occurred:
This approach was inefficient due to daily data refreshes, resulting in repetitive and unnecessary calculations (refer to Figure 2.1).
Optimized Architecture:
In the new implementation:
This optimized approach is made possible through the synergy of dbt and DuckDB (I always knew Psyduck would be helpful at least ONE time).
3. Implementation
Note:
A. Overview
The data pipeline workflow consists of the following steps:
This workflow includes comprehensive testing (not detailed above).
B. Docker Environment Setup.
This pipeline only needs to be executed daily. Therefore, using a docker container which will only be running the time of the pipeline is the go to solution. A basic docker-compose.yml setup was used (based on the airflow-provided configuration with custom modifications). Volumes were added to maintain persistent data on the host machine, though this is optional.
volumes:
- ${AIRFLOW_PROJ_DIR:-.}:/opt/airflow
- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
- ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
- ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
- ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
- ${AIRFLOW_PROJ_DIR:-.}/dbt_project:/opt/airflow/dbt_project
- ${AIRFLOW_PROJ_DIR:-.}/duckdb:/opt/airflow/duckdb
- ${AIRFLOW_PROJ_DIR:-.}/functions:/opt/airflow/plugins/functions
On the other side, the .dockerfile is a such :
# Use the official Airflow image
FROM apache/airflow:latest
USER root
# Install git and other dependencies
RUN apt-get update && apt-get install -y \\
git \\
&& apt-get clean \\
&& rm -rf /var/lib/apt/lists/*
# Set permissions on Airflow directories and other directories as needed
RUN chmod -R 777 /opt/airflow
USER airflow
# Copy the requirements file
COPY requirements.txt /opt/airflow/
# Install the dependencies
RUN pip install -r /opt/airflow/requirements.txt
# Copy the .env file
COPY .env /opt/airflow/.env
# Set the environment variables
ENV AIRFLOW_HOME=/opt/airflow
The packages are : python-dotenv | duckdb | dbt-core==1.8.2 | dbt-duckdb==1.8.1
C. Detailed Implementation steps
1. Updating Data in DuckDB
This step involves importing new data from PostgreSQL that is not yet present in DuckDB. The process is as follows:
A connection is trying to be established. If it succeeds then we “attach” the postgres database to duckdb.
But what “attach” actually means ?
According to duckdb official documention :
The?ATTACH?statement adds a new database file to the catalog that can be read from and written to.
The next step consists of updating the data in duckdb. However, there are several scenarios that can happends, so it is important to take them in consideration if any one of them happends to not make the pipeline fail.
领英推荐
“test test test : rest rest rest“
The most common scenarios that I decided to covered are :
Here is the code implementing this management of some scenarios:
2. Setting up & running the dbt project
This step involves: a. Cloning the dbt project from GitHub if it doesn't exist, or performing a git pull to fetch new features. b. Checking out the "prod" branch. c. Setting up the dbt profile (Figure 3.6). d. Running dbt debug to verify the configuration. e. Executing "dbt build --target prod" to build all models.
Note: To address permission issues in the Docker environment (Linux), specific git and Linux commands were used (e.g., git config --global --add safe.directory dbt_path).
The next steps consists of using the “prod” branch of the project.
it is pretty basic, we just run a “git checkout prod”.
It is also necessary to set up the dbt profile. I created it in a specific task which is not necessary again. You can just put it in the dbt project when you clone it.
Finally comes the time to build all the models. A dbt debug is mandatory to check if the dbt configuration (database connection + profile) is done correctly.
Then we just simply run a “dbt build —target prod” to build all the models.
Here are the airflow logs of this task :
10:34:36 Configuration:
10:34:36 profiles.yml file [OK found and valid]
10:34:36 dbt_project.yml file [OK found and valid]
10:34:36 Required dependencies:
10:34:36 - git [OK found]
10:34:36 Connection:
10:34:36 database: ebay_sales_db
10:34:36 schema: main
10:34:36 path: /opt/***/duckdb/ebay_sales_db.duckdb
10:34:36 config_options: None
10:34:36 extensions: None
10:34:36 settings: None
10:34:36 external_root: .
10:34:36 use_credential_provider: None
10:34:36 attach: None
10:34:36 filesystems: None
10:34:36 remote: None
10:34:36 plugins: None
10:34:36 disable_transactions: False
10:34:36 Registered adapter: duckdb=1.8.1
10:34:36 Connection test: [OK connection ok]
10:34:36 All checks passed!
[2024-06-21, 10:41:10 UTC] {dbt_duckdb.py:212} INFO - DBT build output:
10:34:38 Running with dbt=1.8.2
10:34:38 Registered adapter: duckdb=1.8.1
10:34:58 Found 545 models, 2449 data tests, 3 sources, 793 macros
building the models....
.....
.....
10:41:08 Finished running 2449 data tests, 7 view models, 538 table models in 0 hours 6 minutes and 10.31 seconds (370.31s).
3. Updating tables in postgres database
To ensure the safe and accurate update of PostgreSQL tables, a comprehensive validation process is implemented. This process involves a series of tests to verify that the models created in DuckDB align with expectations. These tests encompass various aspects, including:
This rigorous testing phase effectively serves as a source-to-target mapping, ensuring that all data is properly prepared and validated before any updates are made to the PostgreSQL database.
Following successful validation, the data update process is initiated. This process follows a two-pronged approach:
This methodology ensures that the PostgreSQL database is always synchronized with the most current and accurate data, while maintaining data integrity throughout the entire process.
It is done as such :
4. Advantages & Limitations
a) Advantages
This architecture brings several benefits to the application, ranging from the separation of transactional and analytical workloads to the automation of daily data refresh processes.
b) Limitations
5. Conclusion
This project demonstrates the potential of combining dbt and DuckDB for medium-sized data use cases. The architecture showcases an efficient, on-premises solution that balances performance, reliability, and automation.
While the current implementation is fully on-premise (with data file replication in GCP buckets for redundancy), there is potential for further enhancement through a cloud-native approach. This could involve migrating all data from Postgres to an object storage solution with an appropriate table format such as Apache Iceberg, which DuckDB can query efficiently. This evolution would effectively create a data lakehouse architecture.
In this enhanced scenario, all models would be materialized by dbt in the data lake, and the application would read from the cloud-based data lakehouse instead of the on-premise Postgres database. This approach would offer improved scalability, cost-effectiveness, and alignment with modern data engineering practices.
The project underscores the importance of choosing the right tools for specific data challenges and demonstrates how innovative combinations of technologies can yield significant benefits in data processing and analysis workflows.