Slowly Changing Dimensions in Data Warehouses
Krishna Yogi Kolluru
Data Science Architect | ML | GenAI | Speaker | ex-Microsoft | ex- Credit Suisse | IIT - NUS Alumni | AWS & Databricks Certified Data Engineer | T2 Skilled worker
What is a Data Warehouse?
A data warehouse is a centralized repository where data from different sources is stored. It allows organizations to analyze large volumes of data to make informed decisions. In business intelligence, data warehouses play a crucial role as they provide the data needed for reporting and analysis.
Understanding Dimensions in Data Warehouses
In a data warehouse, data is often categorized into facts and dimensions. Facts are the measurable events or transactions, while dimensions are descriptive attributes related to the facts.
For example, in a sales database, facts could be the number of products sold, and dimensions could be attributes like product name, customer details, and time of sale.
Slowly Changing Dimensions (SCDs)
Slowly Changing Dimensions (SCDs) are dimensions in which data changes slowly over time. For instance, consider a customer’s address. A customer might move to a new address, and this change needs to be captured in the data warehouse without losing the historical data. SCDs are critical because they allow data warehouses to maintain accurate historical records while reflecting current data.
Types of Slowly Changing Dimensions (SCDs)
SCD Type 0: Retain Original
In SCD Type 0, no changes are made to the existing data; the original data remains the same. This type is rarely used because it doesn’t account for any changes.
Example: If a customer’s address is recorded as “123 Main St” and the customer moves, in Type 0, the address remains “123 Main St” even after the customer moves. This might be used in scenarios where historical accuracy is paramount, and changes are either very rare or irrelevant to the analysis.
SCD Type 1: Overwrite
In SCD Type 1, the existing data is simply overwritten with the new data. This approach doesn’t keep track of the historical data.
Example: If a customer’s address changes from “123 Main St” to “456 Elm St”, in Type 1, the data warehouse updates the record to “456 Elm St”. The old address “123 Main St” is lost. This method is suitable when the historical data is not important, such as correcting errors.
Real-world Use Case: Updating the spelling of a customer’s name or correcting a typo where historical data is not necessary.
SCD Type 2: Add New Row
In SCD Type 2, a new row is added to the table with the updated data, and the old data is retained. This type keeps a history of changes.
Example: If a customer’s address changes from “123 Main St” to “456 Elm St”, in Type 2, the data warehouse adds a new row for “456 Elm St” with an effective date and marks the previous row as outdated or closed. This method is ideal for tracking historical data, such as monitoring the location history of a customer.
Scenario: A retail company tracks customer addresses for delivery purposes. If a customer moves, the old address needs to be retained for historical delivery records, while the new address is used for future deliveries.
SCD Type 3: Add New Attribute
In SCD Type 3, a new attribute or column is added to store the new data while retaining the old data in another column.
Example: If a customer’s address changes, an additional column like “Previous Address” is created. The old address “123 Main St” is moved to this column, and the new address “456 Elm St” is placed in the main “Address” column. This method is used when only the previous value needs to be tracked, not the full history.
Use Case: Tracking changes in an employee’s role within an organization where only the current and immediate previous roles are of interest.
领英推荐
SCD Type 4: Add Mini-Dimension
In SCD Type 4, a separate mini-dimension is created to store historical data. This mini-dimension contains only the attributes that change frequently.
Example: A mini-dimension could be created for a customer’s contact information. When a customer’s phone number changes, the new phone number is added to the mini-dimension, and the original dimension only stores a reference to this mini-dimension.
Scenario: An airline company tracks changes in customer preferences (e.g., meal choice, seating preference). Instead of updating the main dimension, a mini-dimension is created to handle these frequent changes.
SCD Type 6: Hybrid Approach
SCD Type 6 combines features of Types 1, 2, and 3 to capture both the current value and the history. It maintains the current value, the historical values, and an indicator of the most recent change.
Example: For a customer’s address change, the data warehouse would store the current address, a historical record with effective dates, and a previous address column.
Real-world Scenario: In a financial institution, tracking account holder details where both the current and historical information is important for regulatory compliance and customer service.
How to Choose the Right SCD Type?
Factors to Consider
Common Pitfalls and How to Avoid Them
Implementing SCDs in Popular ETL Tools
SCD Implementation in SQL
You can implement SCDs in SQL using INSERT, UPDATE, and DELETE statements. For example, to implement SCD Type 2:
-- Assuming you have a customer table
UPDATE customer
SET end_date = CURRENT_DATE
WHERE customer_id = @customer_id
AND end_date IS NULL;
INSERT INTO customer (customer_id, address, start_date, end_date)
VALUES (@customer_id, @new_address, CURRENT_DATE, NULL);
Using ETL Tools like Informatica, SSIS, and Talend
Summary
Follow more data engineering insights!