登录查看更多内容

End-to-End Data Pipeline with Snowflake, Airflow, and dbt

Nguy?n Tu?n D??ng

?? Data Engineer

发布日期: 2024年12月17日

+ 关注

Link repo: https://github.com/ntd284/dbt-airflow-snowflake

?? Overview

In today's data-driven landscape, businesses are shifting from traditional ETL (Extract, Transform, Load) processes to modern ELT (Extract, Load, Transform) workflows to leverage the full power of cloud data warehouses. By using Snowflake as the target database, I implemented a scalable and automated ELT pipeline orchestrated with Airflow and transformed with dbt.

This approach enables:

Faster data ingestion by loading raw data directly into the target database (Snowflake).
Flexible and modular transformations with dbt, ensuring clean, structured data for analytics.
Improved scalability and performance through modern cloud infrastructure.

Components Involved

Snowflake: Target database for storing raw, staging, and transformed data.
Airflow: Orchestrates workflows, schedules tasks, and automates data ingestion.
dbt: Performs modular, SQL-based transformations within Snowflake.
Python: Supports custom logic and integrates with Airflow for task execution.

The challenge

Traditional ETL workflows transform data before loading, which can limit scalability and struggle with large or unstructured datasets. To overcome this, I implemented a modern ELT pipeline using dbt and Snowflake, allowing faster data ingestion, in-warehouse transformations, and improved scalability to meet growing data demands.

Main Tasks

Snowflake Setup: Create warehouse, database, schema, users, roles, and permissions on Snowflake Cloud.
dbt Setup: Install and configure dbt on the VM, connecting it to Snowflake.
Data Transformation: Write and execute dbt queries on Snowflake to transform raw data into staging, intermediate, and data marts with testing.
Airflow Orchestration: Airflow schedules and executes each step: staging → transformation → data mart → testing.

Project Workflow:

1. Setting Up the Snowflake Environment:

To begin the ELT process, you need a data platform to host and transform your data. In this project, we use Snowflake, an ideal solution for performing efficient ELT operations.

Start by creating a trial account with the Enterprise Edition, which provides the role. You can choose your preferred cloud provider and region during setup.

The raw data will be ingested into Snowflake using Airflow. Before that, we need to configure the following:

Database: To store and organize data.
Warehouse: A compute cluster for executing operations (SQL queries, DML, or stored procedures).
Role: A specific role to manage and execute data transformations.

Note: In Snowflake, the term warehouse refers to computational resources for data operations, not traditional storage.

Run the following SQL queries to set up the environment in new worksheet:

2. Configuring the Connection Between dbt, and Snowflake

Get the Snowflake Account:

Find your Snowflake account identifier via the URL (https://<account_identifier>.snowflakecomputing.com) from Admin → Accounts

Initializing dbt for Snowflake:

Run dbt init to set up the project, configure Snowflake connection details (account, credentials, warehouse, database, schema, threads), and verify with dbt debug.

Configure profiles.yml:

Edit the profiles.yml file to include your Snowflake connection details:

Set Up Models in dbt_project.yml:

Define your project structure and configure staging in view and marts in table models and both of them will be processed on Snowflake cloud:

Install dbt Packages:

Add the required packages in the packages.yml file (e.g., dbt-labs/dbt_utils):
Cmd dbt deps to install packages

3. Transform data from Data Raw to Data Staging:

Queries SQL will be located in folder marts and staging in dbt folder in vm, with macro and testing.

macros:

- pricing.sql: A reusable SQL logic block, typically used to calculate or apply specific business rules like pricing adjustments across models.

models/marts:

领英推荐

The ETL to ELT to EtLT Evolution, and data pipelines

Ascend.io 1 年前

Data warehouse, data lake, and the features of ETL and…

Ascend.io 1 年前

Future of Data Analytics with AWS Glue

Spiral Mantra 7 个月前

- fct_orders.sql: Fact table model for orders, aggregating key metrics.

- generic_tests.yml: A YAML configuration defining generic tests (e.g., not null, unique).

- int_order_items_summary.sql: Intermediate model summarizing order items.

- int_order_items.sql: Intermediate model that processes raw order item data.

models/staging:

- stg_tpch_line_items.sql: Staging model for raw line item data, cleaning and organizing it for downstream use.

- stg_tpch_orders.sql: Staging model for raw order data, transforming it for further processing.

- tpch_sources.yml: YAML file defining data sources for the staging models, ensuring proper connection to raw data.

tests:

fct_orders_date_valid.sql: Test script to validate that order dates meet expected criteria.

fct_orders_discount.sql: Test script to check for correct discount calculations in orders.

Queries will be executed on Snowflake with cmd:

dbt run

Transformation from Staging to Data Mart:

4. Integrate with Airflow with astronomer.io:

Astro is a modern platform for managing and deploying Apache Airflow workflows, providing tools like the Astro CLIto simplify development, orchestration, and monitoring of data pipelines

After installing successfully:

Initilizated astronomer:

astro dev init

Start astronomer to activate Airflow:

astro dev start

Copy dbt_pipeline into dags folder:

After login sucessfully in localhost:8080 with account:admin & password: admin

We config connections between Snowflake and Airflow

Finally, we have flow of transformation process.

?? Conclusion:

By leveraging Snowflake as the target database, dbt for modular SQL-based transformations, and Airflow (via Astronomer) for orchestration, I successfully implemented a modern and scalable ELT pipeline. This solution enables:

Efficient Data Transformation: Raw data is ingested, staged, and transformed into data marts for analytics.
Automation and Orchestration: Airflow automates the end-to-end pipeline, ensuring smooth task execution.
Reusability and Scalability: dbt's modular structure and Snowflake's cloud infrastructure ensure flexibility, performance, and scalability for growing data demands.

This pipeline delivers clean, structured, and reliable data for analytics, empowering businesses to make data-driven decisions faster and more efficiently. ??

?? Reference:

[1]. DBT and Snowflake Implementation for Data Transformation

[2]. Code along - build an ELT Pipeline in 1 Hour (dbt, Snowflake, Airflow)

Davy Moura

2 个月

Incredible! Thanks for sharing!

1 次回应

Abdelhak ?

Data Engineer

2 个月

Saved for later reading. Thanks for sharing!

1 次回应

?ukasz D?ugozima

2 个月

Love this !

1 次回应

Pushpak Raja Danaboyena

Data Engineer | Pyspark|Databricks|SQL|Pl/SQL

2 个月

Much needed content

1 次回应

查看更多评论

要查看或添加评论，请登录

Nguy?n Tu?n D??ng的更多文章

Ph?n 3: ?ng d?ng Kafka vào ví d? h? th?ng Ad Click Aggregation

2025年1月15日

Ph?n 3: ?ng d?ng Kafka vào ví d? h? th?ng Ad Click Aggregation

Ad Click Aggregation là h? ???c s? d?ng ?? theo d?i hành vi c?a ng??i dùng th?ng qua hành vi click vào popup qu?ng cáo.…

6 条评论
Ph?n 2: T?ng quan v? Nguyên ly ho?t ??ng c?a Kafka

2025年1月13日

Ph?n 2: T?ng quan v? Nguyên ly ho?t ??ng c?a Kafka

Nh? hình v? ta có th? r?ng trong ki?n trúc c?a Kafka bao g?m có nh?ng ph?n chính v?i nhi?m v? nh? sau: Topic: Gi?ng nh?…
Ph?n 1: Message-driven programming v?i Message broker và Apache Kafka

2025年1月9日

Ph?n 1: Message-driven programming v?i Message broker và Apache Kafka

Chúng ta s? cùng tìm hi?u Kafka qua m?t ví d?: Trong m?t gi?i ??u bóng ?á nh? AFF Cup, do Liên ?oàn Bóng ?á ??ng Nam á…

2 条评论
End-to-End Real-Time Gaming Time Management System using Kafka, Spark, Cassandra,Redis and FastApi

2025年1月6日

End-to-End Real-Time Gaming Time Management System using Kafka, Spark, Cassandra,Redis and FastApi

Git repo: Gaming_Time_Management_System ?? Overview: This project contains the design and implementation of an…

3 条评论
Building an End-to-End Data Pipeline: From Raw Ingestion to Insightful Reporting with USGS Earthquake Data

2024年12月27日

Building an End-to-End Data Pipeline: From Raw Ingestion to Insightful Reporting with USGS Earthquake Data

Git repo: Worldwide_Earthquake_AzureFabric ?? Overview In today's data-driven world, building a seamless, scalable, and…

1 条评论
?? End-to-End Data Pipeline: From Snowflake Cloud & HTTP API to Azure Synapse Analytics

2024年12月23日

?? End-to-End Data Pipeline: From Snowflake Cloud & HTTP API to Azure Synapse Analytics

Git repo: Snowflake_HTTPAPI_Azure_Synapse_Analytics.git ?? Overview In today's data-driven world, businesses thrive on…

3 条评论
Overview of Azure Data Factory Components

2024年11月25日

Overview of Azure Data Factory Components

?? Overview Organizations often deal with massive volumes of data that are often unstructured, scattered across various…
DE project: Transforming Olympic Data Insights (2/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and SuperSet

2024年11月23日

DE project: Transforming Olympic Data Insights (2/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and SuperSet

Reading Part 1 here ?? Overview In the second part of this series, we’ll focus on how Azure Synapse Analytics empowers…

2 条评论
Transforming Olympic Data Insights (1/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and Power BI

2024年11月22日

Transforming Olympic Data Insights (1/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and Power BI

Link repo: https://github.com/ntd284/azure-tokyo-olympic-analytics ?? Overview In today’s data-driven world, building a…

1 条评论
Synchronous, Asynchronous và Message-driven programming.

2024年11月13日

Synchronous, Asynchronous và Message-driven programming.

Trong bài này, ta h?y cùng n?m ba khái ni?m v?: Asynchronous programming. Message-driven programming.

1 条评论

See all articles

End-to-End Data Pipeline with Snowflake, Airflow, and dbt

Nguy?n Tu?n D??ng

?? Data Engineer

?? Overview

The challenge

Main Tasks

Project Workflow:

领英推荐

?? Conclusion:

?? Reference:

Nguy?n Tu?n D??ng的更多文章

社区洞察

其他会员也浏览了

Working with AWS Glue

No-Code Payrun ETL for Snowflake, Databricks, Bigquery, Azure Synapse & Redshift

No-Code InterSystems Cache ETL for Snowflake, Bigquery, Azure Synapse & Redshift

No-Code Tally ETL for Snowflake, Bigquery, Azure Synapse & Redshift

No-Code Sage Intacct ETL for Snowflake, Bigquery, Azure Synapse & Redshift

No-Code Freshbooks ETL for Snowflake, Bigquery, Azure Synapse & Redshift

No-Code Insightly ETL for Snowflake, Bigquery, Azure Synapse & Redshift

Snowflake vs. Databricks - Which Data Platform Reigns Supreme?

Building Data Warehouses with Azure ETL

Data Ingestion Tools : A Comparative View

?? Overview

The challenge

Main Tasks

Project Workflow:

领英推荐

?? Conclusion:

?? Reference:

Nguy?n Tu?n D??ng的更多文章

Ph?n 3: ?ng d?ng Kafka vào ví d? h? th?ng Ad Click Aggregation

Ph?n 2: T?ng quan v? Nguyên ly ho?t ??ng c?a Kafka

Ph?n 1: Message-driven programming v?i Message broker và Apache Kafka

End-to-End Real-Time Gaming Time Management System using Kafka, Spark, Cassandra,Redis and FastApi

Building an End-to-End Data Pipeline: From Raw Ingestion to Insightful Reporting with USGS Earthquake Data

?? End-to-End Data Pipeline: From Snowflake Cloud & HTTP API to Azure Synapse Analytics

Overview of Azure Data Factory Components

DE project: Transforming Olympic Data Insights (2/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and SuperSet

Transforming Olympic Data Insights (1/2): End-to-End Analytics with Azure Data Factory, Databricks, Synapse, and Power BI

Synchronous, Asynchronous và Message-driven programming.

社区洞察

其他会员也浏览了

Working with AWS Glue

No-Code Payrun ETL for Snowflake, Databricks, Bigquery, Azure Synapse & Redshift

No-Code InterSystems Cache ETL for Snowflake, Bigquery, Azure Synapse & Redshift

No-Code Tally ETL for Snowflake, Bigquery, Azure Synapse & Redshift

No-Code Sage Intacct ETL for Snowflake, Bigquery, Azure Synapse & Redshift

No-Code Freshbooks ETL for Snowflake, Bigquery, Azure Synapse & Redshift

No-Code Insightly ETL for Snowflake, Bigquery, Azure Synapse & Redshift

Snowflake vs. Databricks - Which Data Platform Reigns Supreme?

Building Data Warehouses with Azure ETL

Data Ingestion Tools : A Comparative View