Data Build Tool
Divine Sam
Data Engineer @NxtWave | Python | SQL | Data Analysis | Spark | Databricks | GCP | DBT | BigQuery | Looker
The Evolution of Data Warehousing: ETL vs. ELT
In the early days of data warehousing, storage costs were a major bottleneck. Processing large datasets directly in the warehouse wasn't feasible. This led to the development of Extract, Transform, Load (ETL), a three-step process:
However, with the rise of cloud storage and advancements in processing power, storage and computation costs became less of a concern. This paved the way for Extract, Load, Transform (ELT), a more agile approach:
Key Differences:
Tools for ELT:
Choosing Between ETL and ELT:
The best approach depends on your specific needs. Here are some factors to consider:
Slowly Changing Dimensions
Slowly Changing Dimensions (SCDs) are a way to handle data in a data warehouse that changes over time.
There are different approaches to SCD, but generally, it involves keeping a history of changes to the data, instead of simply overwriting it with the latest information. This allows you to analyze the data as it existed at any point in time.
SCD Type 0 : Not Updating DWH when a Dimension Changes. - Data change is only reflect din the source - Skips facts data in DWH table
SCD Type 1 : Updating the DWH table when a dimension changes, overwriting the original data - This is used when the Historical Data is not relevant or has become obsolete.
SCD Type 2: Keeping Full History - Adding additional (historical data) rows for each dimension change. - This type is used when historical data is of importance s saved after each change - All historical data remains recoverable - Additional records are retained for historical data. - Pro: All historical data is accessible. ? ? Con: Requires more storage for additional data - Using SCD type 2 in cases where a lot of changes need to be kept track of, can result in issues with storage and processing? speed. ??
SCD Type 3: Keeping limited History - Adding separate columns for original and current value ?? ? ? - Only relevant historical and change data is captured in separate columns
? ? ? - When a lot of dimension change occurs and need to be captured with efficient processing speed. ? ? ? - But not all historical is captured only the immediate present data is captured and reflected.
What is dbt?
dbt or Data Build Tool is the ’T’ or transform part of ELT process.
dbt, or data build tool, is like a recipe book for your data warehouse. Instead of writing complex code from scratch, you use dbt to write clear instructions (recipes) on how to transform your data.
Here's where it's used:
Imagine you have a bunch of raw ingredients (data) in your warehouse. dbt helps you follow recipes (transformations) to turn those ingredients into delicious dishes (analyzed data) for your business insights.
dbt Overview
Imagine a data warehouse as a dark room. You need to see what's inside, but a simple SQL statement (like a candle) is limited. It provides a temporary glimpse for one person, but lacks reusability and collaboration. This is where dbt comes in as your powerful spotlight.
dbt goes beyond basic SQL by offering:
dbt empowers you to:
Using dbt is like flipping on the lights in the data warehouse. It's a revelation you won't regret.
Technically, dbt enables to make production grade pipeline with software engineering best practices mentioned above.
Installation and Setup
Python and Virtualenv setup, and dbt installation - Windows
Python
This is the Python installer you want to use:?
领英推荐
[https://www.python.org/ftp/python/3.10.7/python-3.10.7-amd64.exe ](https://www.python.org/downloads/release/python-3113/)
Please make sure that you work with Python 3.11 as newer versions of python might not be compatible with some of the dbt packages.
Virtualenv setup
Here are the commands I executed, which should work pretty well for you as well:
cd Desktop
mkdir <name of your directory>
cd <name of the directory>
virtualenv venv
venv\Scripts\activate
Virtualenv setup and dbt installation - Mac
iTerm2
I suggest you to use iTerm2 instead of the built-in Terminal application.
Homebrew
Homebrew is a widely popular application manager for the Mac. This is what we use in the class for installing a virtualenv.
dbt installation ## you can install any dbt extension you want , I just chose snowflake because that’s what I’m learning with and it has a free trial for 30 days or 400$
create course
cd course
virtualenv venv
. venv/bin/activate
pip install dbt-snowflake==1.7.1
which dbt
dbt setup
Initialize the dbt profiles folder on Mac/Linux:
mkdir ~/.dbt
Initialize the dbt profiles folder on Windows:
mkdir %userprofile%\.dbt
Create a dbt project (all platforms):
dbt init dbtlearn
GO AHEAD AND BREAK A LEG!!!! (If you didn’t get it….it’s a way of saying good luck indirectly.)
CTEs are fundamentally simple and easy to learn and implement.
The template which you can use to write any CTE without a remembering the Syntax make sure you register the below in your memory (pun intended)
GENERAL TAKEAWAYS:
You can start building your models in the models folder in your dbt project folder by executing SQL statements.