End-to-End Data Pipeline with Snowflake, Airflow, and dbt

End-to-End Data Pipeline with Snowflake, Airflow, and dbt

Link repo: https://github.com/ntd284/dbt-airflow-snowflake

?? Overview

In today's data-driven landscape, businesses are shifting from traditional ETL (Extract, Transform, Load) processes to modern ELT (Extract, Load, Transform) workflows to leverage the full power of cloud data warehouses. By using Snowflake as the target database, I implemented a scalable and automated ELT pipeline orchestrated with Airflow and transformed with dbt.

This approach enables:

  • Faster data ingestion by loading raw data directly into the target database (Snowflake).
  • Flexible and modular transformations with dbt, ensuring clean, structured data for analytics.
  • Improved scalability and performance through modern cloud infrastructure.

Components Involved

  • Snowflake: Target database for storing raw, staging, and transformed data.
  • Airflow: Orchestrates workflows, schedules tasks, and automates data ingestion.
  • dbt: Performs modular, SQL-based transformations within Snowflake.
  • Python: Supports custom logic and integrates with Airflow for task execution.

The challenge

Traditional ETL workflows transform data before loading, which can limit scalability and struggle with large or unstructured datasets. To overcome this, I implemented a modern ELT pipeline using dbt and Snowflake, allowing faster data ingestion, in-warehouse transformations, and improved scalability to meet growing data demands.

Main Tasks

  • Snowflake Setup: Create warehouse, database, schema, users, roles, and permissions on Snowflake Cloud.
  • dbt Setup: Install and configure dbt on the VM, connecting it to Snowflake.
  • Data Transformation: Write and execute dbt queries on Snowflake to transform raw data into staging, intermediate, and data marts with testing.
  • Airflow Orchestration: Airflow schedules and executes each step: stagingtransformationdata mart testing.

Project Workflow:

1. Setting Up the Snowflake Environment:

To begin the ELT process, you need a data platform to host and transform your data. In this project, we use Snowflake, an ideal solution for performing efficient ELT operations.

Start by creating a trial account with the Enterprise Edition, which provides the role. You can choose your preferred cloud provider and region during setup.

The raw data will be ingested into Snowflake using Airflow. Before that, we need to configure the following:

  • Database: To store and organize data.
  • Warehouse: A compute cluster for executing operations (SQL queries, DML, or stored procedures).
  • Role: A specific role to manage and execute data transformations.

Note: In Snowflake, the term warehouse refers to computational resources for data operations, not traditional storage.

Run the following SQL queries to set up the environment in new worksheet:

2. Configuring the Connection Between dbt, and Snowflake

  • Get the Snowflake Account:

Find your Snowflake account identifier via the URL (https://<account_identifier>.snowflakecomputing.com) from AdminAccounts

  • Initializing dbt for Snowflake:

Run dbt init to set up the project, configure Snowflake connection details (account, credentials, warehouse, database, schema, threads), and verify with dbt debug.

  • Configure profiles.yml:

Edit the profiles.yml file to include your Snowflake connection details:

  • Set Up Models in dbt_project.yml:

Define your project structure and configure staging in view and marts in table models and both of them will be processed on Snowflake cloud:

Install dbt Packages:

  • Add the required packages in the packages.yml file (e.g., dbt-labs/dbt_utils):
  • Cmd dbt deps to install packages

3. Transform data from Data Raw to Data Staging:

Queries SQL will be located in folder marts and staging in dbt folder in vm, with macro and testing.

  • macros:

- pricing.sql: A reusable SQL logic block, typically used to calculate or apply specific business rules like pricing adjustments across models.

  • models/marts:

- fct_orders.sql: Fact table model for orders, aggregating key metrics.

- generic_tests.yml: A YAML configuration defining generic tests (e.g., not null, unique).

- int_order_items_summary.sql: Intermediate model summarizing order items.

- int_order_items.sql: Intermediate model that processes raw order item data.

  • models/staging:

- stg_tpch_line_items.sql: Staging model for raw line item data, cleaning and organizing it for downstream use.

- stg_tpch_orders.sql: Staging model for raw order data, transforming it for further processing.

- tpch_sources.yml: YAML file defining data sources for the staging models, ensuring proper connection to raw data.

  • tests:

fct_orders_date_valid.sql: Test script to validate that order dates meet expected criteria.

fct_orders_discount.sql: Test script to check for correct discount calculations in orders.

Queries will be executed on Snowflake with cmd:

dbt run         

Transformation from Staging to Data Mart:

4. Integrate with Airflow with astronomer.io:

Astro is a modern platform for managing and deploying Apache Airflow workflows, providing tools like the Astro CLIto simplify development, orchestration, and monitoring of data pipelines

After installing successfully:

Initilizated astronomer:

astro dev init        

Start astronomer to activate Airflow:

astro dev start        

Copy dbt_pipeline into dags folder:

After login sucessfully in localhost:8080 with account:admin & password: admin

We config connections between Snowflake and Airflow

Finally, we have flow of transformation process.

?? Conclusion:

By leveraging Snowflake as the target database, dbt for modular SQL-based transformations, and Airflow (via Astronomer) for orchestration, I successfully implemented a modern and scalable ELT pipeline. This solution enables:

  • Efficient Data Transformation: Raw data is ingested, staged, and transformed into data marts for analytics.
  • Automation and Orchestration: Airflow automates the end-to-end pipeline, ensuring smooth task execution.
  • Reusability and Scalability: dbt's modular structure and Snowflake's cloud infrastructure ensure flexibility, performance, and scalability for growing data demands.

This pipeline delivers clean, structured, and reliable data for analytics, empowering businesses to make data-driven decisions faster and more efficiently. ??

?? Reference:

[1]. DBT and Snowflake Implementation for Data Transformation

[2]. Code along - build an ELT Pipeline in 1 Hour (dbt, Snowflake, Airflow)

Davy Moura

Data Engineer | Analytics Engineer | Python | SQL | Spark | ETL | Databricks | DBT | AWS | Azure | Airflow | Power BI

2 个月

Incredible! Thanks for sharing!

Abdelhak ?

Data Engineer

2 个月

Saved for later reading. Thanks for sharing!

Pushpak Raja Danaboyena

Data Engineer | Pyspark|Databricks|SQL|Pl/SQL

2 个月

Much needed content

要查看或添加评论,请登录

Nguy?n Tu?n D??ng的更多文章

社区洞察

其他会员也浏览了