Dimension Tables And Change
Designing dimension tables for data warehouses is one of the most interesting challenges for a data modeler.
Assigning the attributes to the dimension is relatively simple - you know the grain of the data and you know what attributes are available at that grain.
Next is the decision about SCD (Slowly Changing Dimensions). How do we want to handle change for the dimension. Is the data static or do we need to manage change?
There are 6 SCD types.
0 - static data (ignore any changes)
1 - overwrite data (no history is retained)
2 - create new record (history is retained)
3 - history is a new column
4 - a new dimension is added
6 - combination of type 2 and type 3
The most common types used are type 1 and type 2. I have used type 4 a few times, but it was only when it was not performance intensive - I used a trigger to populate the history table. In 40 years I only used type 3 once.
Now that we have the fields it is important to look at the rate of change at the attribute level. If some fields change often and some fields rarely change we may consider switching them into a fast and a slow table. We don't want to be carrying 20 fields where one field changes often and 19 fields rarely change. Then each record will have identical values for 19 fields, swelling our database.
Most people use a simple CDC (change data capture) strategy to detect change. We can then follow the logic required to update the data. A common technique is to concatenate the fields of interest with a separator (like a pipe column) and then Hash8 that value. What you get is the knowledge that at least one field changed value, but you get no visibility to which field.
Some modelers will lean towards 6NF data structures. This provides the ability to track change at the attribute level. But, it also produces significant complexity for loading the data and it is painful for any SQL Developer. As much as I love 6NF, I would not recommend it as a first choice. The complexity outweighs most of the benefits.
For SCD type 2, you will use two dates; start date and end date. You could use NULL for the end date for the most recent record, but a lot of people will use a date far into the future (12/31/2099). It makes SQL queries easier.