Data Build Tool

Data Build Tool

The Evolution of Data Warehousing: ETL vs. ELT

In the early days of data warehousing, storage costs were a major bottleneck. Processing large datasets directly in the warehouse wasn't feasible. This led to the development of Extract, Transform, Load (ETL), a three-step process:

  1. Extract: Data is pulled from various sources.
  2. Transform: The data is cleaned, formatted, and transformed to meet the specific needs of the data warehouse. This transformation typically occurs in a separate staging area.
  3. Load: The transformed data is loaded into the data warehouse for analysis.

However, with the rise of cloud storage and advancements in processing power, storage and computation costs became less of a concern. This paved the way for Extract, Load, Transform (ELT), a more agile approach:

  1. Extract: Data is pulled from various sources.
  2. Load: The raw data is loaded directly into the data warehouse.
  3. Transform: The data is transformed within the warehouse itself, leveraging its processing capabilities.

Key Differences:

  • Transformation Location: ETL transforms data in a staging area before loading it, while ELT transforms data within the warehouse after loading.
  • Data Storage: ETL typically doesn't store raw data in the warehouse, while ELT can store it for future exploration.
  • Flexibility: ELT offers more flexibility for working with diverse and evolving data sources.

Tools for ELT:

  • Data Extraction: Tools like Fivetran or Stitch can automate data extraction from various sources.
  • Data Transformation: dbt (data build tool) is a popular option for transforming data within the warehouse.

Choosing Between ETL and ELT:

The best approach depends on your specific needs. Here are some factors to consider:

  • Data complexity: ETL is better suited for complex data transformations.
  • Data size: ELT can handle larger datasets more efficiently.
  • Data governance: ETL offers better control over data quality and security.
  • Analytics needs: ELT provides more flexibility for exploratory analysis.



Slowly Changing Dimensions

Slowly Changing Dimensions (SCDs) are a way to handle data in a data warehouse that changes over time.

There are different approaches to SCD, but generally, it involves keeping a history of changes to the data, instead of simply overwriting it with the latest information. This allows you to analyze the data as it existed at any point in time.


SCD Type 0 : Not Updating DWH when a Dimension Changes. - Data change is only reflect din the source - Skips facts data in DWH table

SCD TYPE 0


SCD Type 1 : Updating the DWH table when a dimension changes, overwriting the original data - This is used when the Historical Data is not relevant or has become obsolete.

SCD TYPE 1


SCD Type 2: Keeping Full History - Adding additional (historical data) rows for each dimension change. - This type is used when historical data is of importance s saved after each change - All historical data remains recoverable - Additional records are retained for historical data. - Pro: All historical data is accessible. ? ? Con: Requires more storage for additional data - Using SCD type 2 in cases where a lot of changes need to be kept track of, can result in issues with storage and processing? speed. ??


SCD Type 3: Keeping limited History - Adding separate columns for original and current value ?? ? ? - Only relevant historical and change data is captured in separate columns

? ? ? - When a lot of dimension change occurs and need to be captured with efficient processing speed. ? ? ? - But not all historical is captured only the immediate present data is captured and reflected.


What is dbt?

dbt or Data Build Tool is the ’T’ or transform part of ELT process.

dbt, or data build tool, is like a recipe book for your data warehouse. Instead of writing complex code from scratch, you use dbt to write clear instructions (recipes) on how to transform your data.

Here's where it's used:

Imagine you have a bunch of raw ingredients (data) in your warehouse. dbt helps you follow recipes (transformations) to turn those ingredients into delicious dishes (analyzed data) for your business insights.


dbt Overview

Imagine a data warehouse as a dark room. You need to see what's inside, but a simple SQL statement (like a candle) is limited. It provides a temporary glimpse for one person, but lacks reusability and collaboration. This is where dbt comes in as your powerful spotlight.


dbt goes beyond basic SQL by offering:

  • Reusability: Modular transformations written in SQL can be easily reused across projects, saving time and effort.
  • Collaboration: dbt fosters teamwork. Everyone can see and understand the data transformations, ensuring consistency and reducing errors.
  • Lineage and Documentation: dbt automatically tracks dependencies and generates clear documentation, making it easy to understand how data flows through the system.
  • Testing: dbt enables automated testing to guarantee data quality and catch issues before they impact downstream users.


dbt empowers you to:

  • Shine a light on every corner of your data warehouse. Gain a comprehensive understanding of your data with clear visibility into transformations.
  • Work together effectively. Everyone on your team can collaborate and share knowledge about the data.
  • Ensure data quality. Automated testing helps maintain reliable data for accurate analysis.


Using dbt is like flipping on the lights in the data warehouse. It's a revelation you won't regret.

Technically, dbt enables to make production grade pipeline with software engineering best practices mentioned above.


A General dbt Architecture


Installation and Setup

Python and Virtualenv setup, and dbt installation - Windows

Python

This is the Python installer you want to use:?

[https://www.python.org/ftp/python/3.10.7/python-3.10.7-amd64.exe ](https://www.python.org/downloads/release/python-3113/)


Please make sure that you work with Python 3.11 as newer versions of python might not be compatible with some of the dbt packages.


Virtualenv setup

Here are the commands I executed, which should work pretty well for you as well:

cd Desktop

mkdir <name of your directory>

cd <name of the directory>

virtualenv venv

venv\Scripts\activate        


Virtualenv setup and dbt installation - Mac


iTerm2

I suggest you to use iTerm2 instead of the built-in Terminal application.

https://iterm2.com/


Homebrew

Homebrew is a widely popular application manager for the Mac. This is what we use in the class for installing a virtualenv.

https://brew.sh/


dbt installation ## you can install any dbt extension you want , I just chose snowflake because that’s what I’m learning with and it has a free trial for 30 days or 400$


create course

cd course

virtualenv venv

. venv/bin/activate

pip install dbt-snowflake==1.7.1

which dbt        


dbt setup

Initialize the dbt profiles folder on Mac/Linux:

mkdir ~/.dbt        


Initialize the dbt profiles folder on Windows:

mkdir %userprofile%\.dbt        


Create a dbt project (all platforms):

dbt init dbtlearn        



GO AHEAD AND BREAK A LEG!!!! (If you didn’t get it….it’s a way of saying good luck indirectly.)


MODELS in dbt are SQL Statements and? usage of CTEs are a common practice



CTEs are fundamentally simple and easy to learn and implement.

  • Helps us write readable and maintainable code quickly
  • By definition they a re temporary names resultant set.
  • In reality, CTE’s are great because the result remains in-memory during the execution of SELECT, INSERT UPDATE DELETE OR MERGE Statements.


The template which you can use to write any CTE without a remembering the Syntax make sure you register the below in your memory (pun intended)

  • The WITH Clause is basically telling the database that we are creating a Common Table Expression.
  • Then we give a <name of the_result_set> our CTE, to reference it later.
  • The ([column names]) are completely optional, they are used to set aliases for the column that come out of the CTE.
  • The <cte_query> is a SELECT statement. - In reality it is a temporary table that will referenced from the FROM clause, like other everyday tables.
  • The <reference_the_CTE> part is where we reference and execute the CTE.


GENERAL TAKEAWAYS:

  1. Although perform almost the same functionality as a view, CTEs will not store the definition in metadata
  2. CTEs are preferred because they extremely readable, easy to maintain and make complex queries understandable.
  3. CTEs can be used in Stores Procedures, Functions, Triggers or even Views.


You can start building your models in the models folder in your dbt project folder by executing SQL statements.




要查看或添加评论,请登录

Divine Sam的更多文章

  • Understanding Change Data Capture(CDC) in Delta Lake

    Understanding Change Data Capture(CDC) in Delta Lake

    Introduction In the world of data, keeping information up-to-date across multiple systems can be challenging. This is…

    2 条评论
  • Steps Involved in Designing a Data Warehouse

    Steps Involved in Designing a Data Warehouse

    Designing a data warehouse is a crucial task for any organization looking to leverage its data for business…

  • Airflow Basics: Installation & Guide

    Airflow Basics: Installation & Guide

    What is Apache Airflow? Apache Airflow, or Airflow, is an open-source tool and framework for running your data…

    2 条评论

社区洞察

其他会员也浏览了