Data Tips #12 - Managing Facts data
Olof Granberg
AI Strategist | D&A Mentor | Data Evangelist & Public Speaker | Executive Advisor | Data, AI & Privacy Tech
Today we will discuss the handling of Facts data.
Facts as we have discussed before are business events. Sales, stock movement, orders, clicks, etc.
We will use these facts in several ways including (but not limited to):
As we can determine from the above scenarios there are very different requirements on the data when it comes to:
To support all use cases we have to support both low-latency requirements as well as giving a total picture of historical facts.
There are several ways of handling this:
I personally favour the Separate after ingestion pattern, mainly because it does not hamper the streaming case but it also allows you to handle the tricky business logic and data logic in an easier manner. As pattern #2 I would choose to stream everything.
What about the dimensions?
So, when the event data has reached a certain point in your data platform you want to enrich the data with the dimensional data. The traditional way of handling this is to use something called surrogate keys that gives you a direct way of joining Facts to Dimensions. The original ID is swapped against the surrogate key. This however can be both compute consuming and complex so depending on the use case it can be good to just keep the original IDs and handle the joins with those keys instead. For very fast querying though it makes sense to use simple joins such as key <-> key.
Early arriving facts is a term where the events occur before the dimensional data has arrived, an item has been sold before the item description has arrived to the data platform. In that case it is easiest to create a dummy dimension row for that ID and update it once the dimensional data has arrived. Do not replace the IDs in the fact table with "unknown" or similar, that will destroy the data.
Aggregates
Aggregates can be important, especially for lowering query time and compute cost. These should be used whenever you see that the granular data is often summed up to day/week/year or in some other dimension.
They should however be used when needed rather than be created for every eventuality since if the underlying data changes you will need to reload the aggregates.
That is it for today, in a later post we will look at how to drive your data pipelines based on the data received rather than the processing date. (this is important)