登录查看更多内容

Snowflakes and why not to use them (2/4): Implementation

Petr Podrouzek

?? Global Tech Leader | ?? SVP at Emplifi | ?? Strategy & Engineering Excellence

发布日期: 2017年1月1日

In the previous text I have defined what snowflake and star schema is. I also explained the principle of slowly changing dimension type 2. In this text I will move from concepts to actual implementation. Imagine the following situation. You have a simple 2 level hierarchy of categories (that means subcategories and categories) and the OLTP table could look like this:

Fig 4: Source system table (recursive hierarchy)

As you can clearly see the table is implemented in a recursive manner. The depth of the category tree can be infinite (although we assume only 2 levels). This approach is not good for reporting tools and you normally “flatten out” such hierarchy. If you choose a snowflake schema with two linked dimensions (DimCategory and DimSubcategory) it would look like this:

Fig 5: Snowflake dimensions

Should you decide to go with the star schema, the “flattened out” hierarchy would most likely look like this:

Fig 6: Star dimension

As you can see, in the star schema, the data is present redundantly. For each subcategory category data is included as well. A category can have N subcategories which means that the category data is redundant N-1 times in the star schema. Now assume that both dimensions utilize slowly changing dimension type 2 concept (SCD2). Also assume that while the categories are fairly stable (do not change too much), the subcategories change often (names, descriptions etc.). This is reasonable assumption because subcategories might have many more attributes (not just business key, name and description as in the example above). In this case, the problem with data redundancy gets even worse. Why? Because SCD2 versions all the records and if there are M subcategory changes over period of time then the data of the category will be replicated N + M – 1 times. Which is not very efficient. This inefficiency does not happen in snowflake schema because the data is partly normalized.

Few last words...

In this text I have implemented both snowflake and star schema SCD2 dimensions. I have also shown that star schema can be inefficient in using storage. So the natural answer would be – let's use snowflake. In the next article I will discuss if this straightforward approach is actually good.

Snowflakes and why not to use them (1/4): The Basics

My previous technical articles:

John Souza

8 年

I agree with the comments about the choice of tech mattering a lot here. The Kimball methodology is deeply rooted in the assumption that one fact table can be efficiently inner-joined with several dimension tables in a single pass if the dimensions have been flattened (ie: no snow-flaking). While this was (and still is) true with many traditional RDBMS's like Oracle and MySQL; this assumption doesn't always hold with distributed map-reduce architectures. I'm not saying that everything about the old way is wrong (I'm actually a huge Kimball fan); just that each situation must be considered in more depth than it used to.

1 次回应

Sean Spann

8 年

I agree with the last comment regarding who the end user is and how the data will be used. Standard, canned reporting use cases can make use of this approach in general. Another factor to consider is the set of tools in the hands of the end user, their sophistication with using them, and the flexibility they require for analyzing the data from a different perspective or mashing it up against a different data model. In other words, if I know the questions to be asked, I have business users with little SQL or technical background and I am needing to produce canned reports this approach is a traditional one. For instances when the business keeps asking new questions and the "grain" and dimensions change, then I need to use an approach and tools that can derive the dimensions and create these associations on the fly. Regarding snowflake or not, I would typically not waste time doing it and it adds a join that isn't needed. I would be more focused on data cleansing and conforming to the enterprise's definition for the dimensions. Storage is cheap.

Karthik Ramamurthy

Big Data & Analytics - Data Solution Architect

8 年

Good article, I would say another factor to consider is the End User who supposed to use this Dimensional Model. Non Techinical Users might prefer to keep it denormalized for ease of use.

Gary Melhaff

Data Architect, Saxophonist

8 年

Also consider the likelihood that more levels will be needed and whether the hierarchy will be unbalanced/ragged

Larry S.

I make your data ready for AI (and humans, too)

8 年

Given that these type of dimensions are small, query performance is not a useful evaluation criterion: regardless of architecture, query performance will be high. SCD2 bridge tables, which link rows in the Category and Subcategory tables, make CRUD simple. Finally, by using metadata to define hierarchies, all CRUD operations for the SCD2 dimension hierarchy pattern can be hydrated automatically, which reduces TCO (Total Cost of Ownership). Bottom line: thinking outside snowy stars lets you build a data warehouse that is fit to purpose, easily maintainable, low in cost, and highly performant.

3 次回应

查看更多评论

要查看或添加评论，请登录

Petr Podrouzek的更多文章

Innovating with Emplifi Unified Analytics ??

2023年12月4日

Innovating with Emplifi Unified Analytics ??

Last week marked a significant milestone for Emplifi with the launch of Unified Analytics ??. This feature symbolizes…

3 条评论
My goals for 2022

2022年1月31日

My goals for 2022

My goal for January was to come up with what I would like to achieve in 2022 professionally. I asked myself how can I…

2 条评论
Top 3 things I did in 2021

2022年1月3日

Top 3 things I did in 2021

I truly believe it is important to reflect on past experiences and learnings. Sometimes we are so preoccupied with the…

3 条评论
Data warehouse release nightmares (2/2)

2017年11月5日

Data warehouse release nightmares (2/2)

In the previous text, I have discussed the issues I have encountered when releasing DWH based on my 10 years of…
Data warehouse release nightmares (1/2)

2017年10月1日

Data warehouse release nightmares (1/2)

I have been a BI/DWH developer for nearly 10 years now and most of the projects I have worked on had one particular…

1 条评论
What value can MDM bring to organisations and at what cost? (3/3)

2017年7月3日

What value can MDM bring to organisations and at what cost? (3/3)

In the previous articles, I have discussed the benefits of implementing MDM but also the costs it can bring. Now let's…
What value can MDM bring to organisations and at what cost? (2/3)

2017年6月11日

What value can MDM bring to organisations and at what cost? (2/3)

In the previous article, I discussed the benefits of implementing MDM. As with any technology, it does come with a cost…
What value can MDM bring to organisations and at what cost? (1/3)

2017年5月8日

What value can MDM bring to organisations and at what cost? (1/3)

There are many applications supporting various processes in organizations - there is not a single process that would…

3 条评论
Snowflakes and why not to use them (4/4): Conclusion

2017年3月5日

Snowflakes and why not to use them (4/4): Conclusion

In the previous articles, I have introduced simple snowflake utilizing SCD2 model. I have also build a simple star…

1 条评论
Snowflakes and why not to use them (3/4): The problem

2017年2月6日

Snowflakes and why not to use them (3/4): The problem

As stated in the previous article, snowflakes can be more efficient in using storage compared to stars. And believe me,…

2 条评论

See all articles

Snowflakes and why not to use them (2/4): Implementation

Petr Podrouzek

?? Global Tech Leader | ?? SVP at Emplifi | ?? Strategy & Engineering Excellence

Few last words...

Previous articles:

My previous technical articles:

Petr Podrouzek的更多文章

社区洞察

其他会员也浏览了

Struggling to troubleshoot your code in Snowflake? You need to read this.

TIQ Part 3 – Ultimate Guide to Date dimension creation

Working with Dataverse data - the 'legacy' of the Common Data Service

Handy Snowflake functions

Pancake Power: How You Can Eat Any JSON Complexities on Snowflake for Breakfast?

Microsoft Data Platform News 2024 - Week 09

MDS Newsletter #67

The Hitchhiker's Guide to Data Lineage - Part III

MUI Data Grid Table - Swiss Army Knife of tables

Microsoft Data Platform News 2023 - Week 48

Few last words...

Previous articles:

My previous technical articles:

Petr Podrouzek的更多文章

Innovating with Emplifi Unified Analytics ??

My goals for 2022

Top 3 things I did in 2021

Data warehouse release nightmares (2/2)

Data warehouse release nightmares (1/2)

What value can MDM bring to organisations and at what cost? (3/3)

What value can MDM bring to organisations and at what cost? (2/3)

What value can MDM bring to organisations and at what cost? (1/3)

Snowflakes and why not to use them (4/4): Conclusion

Snowflakes and why not to use them (3/4): The problem

社区洞察

其他会员也浏览了

Struggling to troubleshoot your code in Snowflake? You need to read this.

TIQ Part 3 – Ultimate Guide to Date dimension creation

Working with Dataverse data - the 'legacy' of the Common Data Service

Handy Snowflake functions

Pancake Power: How You Can Eat Any JSON Complexities on Snowflake for Breakfast?

Microsoft Data Platform News 2024 - Week 09

MDS Newsletter #67

The Hitchhiker's Guide to Data Lineage - Part III

MUI Data Grid Table - Swiss Army Knife of tables

Microsoft Data Platform News 2023 - Week 48