Using Virtual Data Marts the right way
Uli Bethke
Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion
Virtual data marts can be a useful design pattern but there are a few things you should know before you use them.
Virtual data marts are logical views modeled dimensionally on top of an integration layer or a persisted staging area.
Don't confuse them with data virtualisation or data federation. Virtual data marts are built against a single data store (data warehouse platform). Data virtualisation is a data federation technique across multiple data stores that can be useful in certain edge cases (a topic for another day).
You can mix virtual marts with physical marts, i.e. you can set down the data physically for some fact or dimension tables but not for others.
The advantage of virtual data marts is that they will save you in development time and also when making changes (not Database Change Management).
The main downside is that they tend to perform less well than physical tables in particular for larger data volumes or complex logic.
In certain cases such as snapshots your only option may be to set down the data physically.
There are some other disadvantages of virtual data marts.
- You need to create OUTER JOINS from your virtual facts to the virtual dimension tables. You can't use surrogate keys. Also hashes are prohibitive due to the huge overhead in compute costs.
- People find it hard to get used to virtual data marts. Don't underestimate that factor.
- BI tools often require a surrogate key to work well with dimensional models.
I recommend using virtual data marts as the default approach in particular for the first iteration of your dimensional model. You need to have an approach that allows you to quickly turn a virtual data mart into a physical table. Data warehouse automation works well and it just takes minutes to swap out the virtual mart for a physical table, e.g. we have developed such automation on top of Airflow.
Snowflake Certified Data Engineer & BI Solutions Architect
3 年Hi Uli. Great article. So, we are using the Snowflake cloud database (columnar), which from a performance standpoint, does not care about 3NF vs Star vs "Virtual Data Mart". Also, let's say that I do actually build physical dimension tables, which have surrogate keys, start and end dates. etc (SCDs). At this point I now have a hybrid model. I have a 3NF model + a set of Dimension tables. Building on that, I can create views on top of my (true and perfect) 3NF data model which, based on transaction dates in the core tables, will indeed pull the correct surrogate keys out of my dimensions. At this point, I now have any easy to use "Virtual Data Mart" for all consumers. With this approach, I can now easily model into a star schema for any BI tool. Additionally, this way, I do not have to build any additional ELT into "stovepipe" physical data mart fact tables. I can also answer any question that the business can ask, and with Snowflake, get the same performance as I would have with physical stars. I see no downside with this approach. What do you think? Dave