登录查看更多内容

Snowflakes and why not to use them (1/4): The Basics

Petr Podrouzek

?? Global Tech Leader | ?? SVP at Emplifi | ?? Strategy & Engineering Excellence

发布日期: 2016年12月4日

Stars and snowflakes

There are two basic ways how to model data for reporting and analytics. These are star schema and snowflake schema. The only difference is that snowflake schema can contain links between dimensions, while in star schema dimensions can be linked only directly to the fact table. The diagrams below illustrate the differences:

Fig 1.: Snowflake schema (source: Wikipedia, Snowflake.)

Fig 2.: Star schema (source: Wikipedia, Star.)

Slowly changing dimension

Most of the datawarehouses will keep history of changing data. This means that you want to track how dimensional data change. For example, if customer changes address you want to keep both records – before the change and after the change. Although I worked on some datawarehouses where this was not implemented and dimensions were always completely reloaded, most of the implementations I was involved in did contain slowly changing dimensions (SCD) and most often it was SCD2. I will shortly explain SCD2 but will not elaborate on the other types, there is plenty of stuff out there:

SCD2 versions data using two timestamp columns (ValidFrom and ValidTo)
Current records has ValidTo = NULL
Once a record is updated or deleted ValidTo is updated to the time of deletion or update; if this time is not supplied by the source system it is the time of the ETL load (just simple GETDATE()).
If you load round the clock the precision of this timestamp might need to be up seconds
I would strongly suggest to implement hash to identify the changes. Hash would be calculated over all the columns except business key (but that can be included as well but it is of no use). The whole data flow can be described using the following diagram:

Fig 3.: Data flow of SCD2 with hashes

Few last words...

In this text I have explained the basic terms I will be using in the next articles. These are star and snowflake schemas and slowly changing dimension type 2. In the next text I will implement those in MS SQL 2012. What about you? Do you use snowflakes or stars? Do you use SCDs?

Sources:

Wikipedia, Star. (n.d.). Retrieved June 14, 2016, from https://en.wikipedia.org/wiki/Star_schema
Wikipedia, Snowflake. (n.d.). Retrieved June 14, 2016, from https://en.wikipedia.org/wiki/Snowflake_schema

My previous technical articles:

Agile BI Series
FLOGS: Agile SSIS Series

Ron van der Poel

BI Engineer Azure Cloud / Power BI at Rabobank Netherlands

8 年

Hai Petr, how do you implement hash in a DB2 environment?

Roger Jackson

Retiree at University of Life

8 年

Nowflake brings the worst features of both Star and Relational... it guarantees the integrity of data structures at the cost of high performance big queries. No one solution is ever perfect in every way?

2 次回应

M E Cizniar, CDMP

Data Architect

8 年

The problem of the snowflake schema is performance: the more joins, the less performance. This is why we try to avoid them if we can. There is no shame in denormalizing a datawarehouse, quite the contrary. The shame is on slow reports! Also, I like your diagram. I have used this technique not for SCD2 but for Change Data Capture (differential load of big tables). Same principles. But you explain it neatly.

1 次回应

Manrique Ulloa Steinvorth

CEO at ieSoft

8 年

I prefer Star schema overall. Not sure the last time I used a Snowflake schema. I say about 50% of my projects I have used SCD2, depends on project requirements. Hash approach has always been the best way for me to change data.

Pugazendhi A.

Sudar.io - Extreme Automation of AI data foundation and data products A first in many ways! - Enterprise Data Warehouse (DV2.0) Design Deploy & Load complete no-code automation - SQL only Business Vault

8 年

Nice article. I agree this is would be a great SCD2 load strategy for many use cases. Most cases I have seen have source as more than one table for the SCD2 dimension. In this case different columns of the dimension come from different source tables. I guess this strategy would count on a staging table with columns matching the target data. The overhead with managing the staging table this way needs to be weighed against the efficiency by this process. Also, when we get delta feeds of the source(s), the 'Delete record' step will need a different route.

查看更多评论

要查看或添加评论，请登录

Petr Podrouzek的更多文章

Innovating with Emplifi Unified Analytics ??

2023年12月4日

Innovating with Emplifi Unified Analytics ??

Last week marked a significant milestone for Emplifi with the launch of Unified Analytics ??. This feature symbolizes…

3 条评论
My goals for 2022

2022年1月31日

My goals for 2022

My goal for January was to come up with what I would like to achieve in 2022 professionally. I asked myself how can I…

2 条评论
Top 3 things I did in 2021

2022年1月3日

Top 3 things I did in 2021

I truly believe it is important to reflect on past experiences and learnings. Sometimes we are so preoccupied with the…

3 条评论
Data warehouse release nightmares (2/2)

2017年11月5日

Data warehouse release nightmares (2/2)

In the previous text, I have discussed the issues I have encountered when releasing DWH based on my 10 years of…
Data warehouse release nightmares (1/2)

2017年10月1日

Data warehouse release nightmares (1/2)

I have been a BI/DWH developer for nearly 10 years now and most of the projects I have worked on had one particular…

1 条评论
What value can MDM bring to organisations and at what cost? (3/3)

2017年7月3日

What value can MDM bring to organisations and at what cost? (3/3)

In the previous articles, I have discussed the benefits of implementing MDM but also the costs it can bring. Now let's…
What value can MDM bring to organisations and at what cost? (2/3)

2017年6月11日

What value can MDM bring to organisations and at what cost? (2/3)

In the previous article, I discussed the benefits of implementing MDM. As with any technology, it does come with a cost…
What value can MDM bring to organisations and at what cost? (1/3)

2017年5月8日

What value can MDM bring to organisations and at what cost? (1/3)

There are many applications supporting various processes in organizations - there is not a single process that would…

3 条评论
Snowflakes and why not to use them (4/4): Conclusion

2017年3月5日

Snowflakes and why not to use them (4/4): Conclusion

In the previous articles, I have introduced simple snowflake utilizing SCD2 model. I have also build a simple star…

1 条评论
Snowflakes and why not to use them (3/4): The problem

2017年2月6日

Snowflakes and why not to use them (3/4): The problem

As stated in the previous article, snowflakes can be more efficient in using storage compared to stars. And believe me,…

2 条评论

See all articles

Snowflakes and why not to use them (1/4): The Basics

Petr Podrouzek

?? Global Tech Leader | ?? SVP at Emplifi | ?? Strategy & Engineering Excellence

Stars and snowflakes

Slowly changing dimension

Few last words...

Sources:

My previous technical articles:

Petr Podrouzek的更多文章

社区洞察

其他会员也浏览了

Data source

Weekly Newsletter: Data Days 2024 Edition!

Unlocking the Power of Semi-Structured Data in Snowflake

Struggling to troubleshoot your code in Snowflake? You need to read this.

Querying a Snowflake Data Warehouse

25 Different Types of Data Visualizations You Should be Using: A blog offering insight into different kinds of data visualizations.

Spring Data Series 2

TIQ Part 3 – Ultimate Guide to Date dimension creation

MDS Newsletter #67

Stars and snowflakes

Slowly changing dimension

Few last words...

Sources:

My previous technical articles:

Petr Podrouzek的更多文章

Innovating with Emplifi Unified Analytics ??

My goals for 2022

Top 3 things I did in 2021

Data warehouse release nightmares (2/2)

Data warehouse release nightmares (1/2)

What value can MDM bring to organisations and at what cost? (3/3)

What value can MDM bring to organisations and at what cost? (2/3)

What value can MDM bring to organisations and at what cost? (1/3)

Snowflakes and why not to use them (4/4): Conclusion

Snowflakes and why not to use them (3/4): The problem

社区洞察

其他会员也浏览了

Data source

Weekly Newsletter: Data Days 2024 Edition!

Unlocking the Power of Semi-Structured Data in Snowflake

Struggling to troubleshoot your code in Snowflake? You need to read this.

Querying a Snowflake Data Warehouse

25 Different Types of Data Visualizations You Should be Using: A blog offering insight into different kinds of data visualizations.

Spring Data Series 2

TIQ Part 3 – Ultimate Guide to Date dimension creation

MDS Newsletter #67