Dimension Tables And Change

Dimension Tables And Change

Designing dimension tables for data warehouses is one of the most interesting challenges for a data modeler.

Assigning the attributes to the dimension is relatively simple - you know the grain of the data and you know what attributes are available at that grain.

Next is the decision about SCD (Slowly Changing Dimensions). How do we want to handle change for the dimension. Is the data static or do we need to manage change?

There are 6 SCD types.

0 - static data (ignore any changes)

1 - overwrite data (no history is retained)

2 - create new record (history is retained)

3 - history is a new column

4 - a new dimension is added

6 - combination of type 2 and type 3

The most common types used are type 1 and type 2. I have used type 4 a few times, but it was only when it was not performance intensive - I used a trigger to populate the history table. In 40 years I only used type 3 once.

Now that we have the fields it is important to look at the rate of change at the attribute level. If some fields change often and some fields rarely change we may consider switching them into a fast and a slow table. We don't want to be carrying 20 fields where one field changes often and 19 fields rarely change. Then each record will have identical values for 19 fields, swelling our database.

Most people use a simple CDC (change data capture) strategy to detect change. We can then follow the logic required to update the data. A common technique is to concatenate the fields of interest with a separator (like a pipe column) and then Hash8 that value. What you get is the knowledge that at least one field changed value, but you get no visibility to which field.

Some modelers will lean towards 6NF data structures. This provides the ability to track change at the attribute level. But, it also produces significant complexity for loading the data and it is painful for any SQL Developer. As much as I love 6NF, I would not recommend it as a first choice. The complexity outweighs most of the benefits.

For SCD type 2, you will use two dates; start date and end date. You could use NULL for the end date for the most recent record, but a lot of people will use a date far into the future (12/31/2099). It makes SQL queries easier.

要查看或添加评论,请登录

Nigel Shaw的更多文章

  • Integer Bitmaps - Data Modeling

    Integer Bitmaps - Data Modeling

    One of the greatest qualities in data modeling is elegance - finding a solution that is beautiful and simple. Integer…

    2 条评论
  • Adventures In Data

    Adventures In Data

    I actually think it is harder now to be a data analyst - there are so many tools and possibilities. It must hard to…

  • Being A Great IT Manager

    Being A Great IT Manager

    You were selected to lead. Just being a manager barely meets any job requirement.

  • Data Quality - The Elephant In The Room

    Data Quality - The Elephant In The Room

    I think there is a general reticence to speak the truth; data quality issues are ultimately defects, and what we should…

  • No Room For Heroes

    No Room For Heroes

    When you work with data you need a real team with both depth and width. There is often a tendency to make your team…

  • Data Undressed

    Data Undressed

    Data at the source is wrapped in layers of metadata; definition, lineage, value, security and on and on. Once we start…

  • Noor Inayat Khan

    Noor Inayat Khan

    If you ever visit Dachau, you will see a simple memorial plaque in the Memorial Hall for Noor Inayat Khan. She is one…

  • The Eternal Battle - Data vs App

    The Eternal Battle - Data vs App

    If you look at the root of most data quality issues it is normally embedded within apps - either poor data design (how…

    1 条评论
  • Dashboard Trust Transparency

    Dashboard Trust Transparency

    When you publish a dashboard to your stakeholders you are making a commitment to them to provide timely and accurate…

  • Domains, Lists And Data

    Domains, Lists And Data

    Mastering data is all about mastering the basics, and there is no better place to start than domains. A domain is…

社区洞察