Data Tips #12 - Managing Facts data

Olof Granberg

AI Strategist | D&A Mentor | Data Evangelist & Public Speaker | Executive Advisor | Data, AI & Privacy Tech

发布日期: 2024年2月16日

+ 关注

Today we will discuss the handling of Facts data.

Facts as we have discussed before are business events. Sales, stock movement, orders, clicks, etc.

We will use these facts in several ways including (but not limited to):

Historical reporting & analysis
Building predictive or prescriptive analytics based on historical behaviour
Trigger an action based on the event or a series of events

As we can determine from the above scenarios there are very different requirements on the data when it comes to:

Latency (time from event occurred until data is available or action is taken)
Completeness (have we processed all of the data)
Correctness & accuracy (have we performed adequate data quality checks)

To support all use cases we have to support both low-latency requirements as well as giving a total picture of historical facts.

There are several ways of handling this:

Separating the different use cases - If there is little overlap between the different use cases it may make sense to just separate them. Handle events fast and efficient and batch the same data into the historical data store.
Stream everything - This is a compelling pattern because you only need to handle one code base, the tricky part is handling the data completeness and data quality issues. They will in most likelihood require periodic jobs that clean and aggregate the data.
Separate after ingestion - Start by streaming all events, store the raw events in a data store, continue the streaming use case and finally pick up the historic data from the raw data store into the historical tables in a batch job.

I personally favour the Separate after ingestion pattern, mainly because it does not hamper the streaming case but it also allows you to handle the tricky business logic and data logic in an easier manner. As pattern #2 I would choose to stream everything.

What about the dimensions?

So, when the event data has reached a certain point in your data platform you want to enrich the data with the dimensional data. The traditional way of handling this is to use something called surrogate keys that gives you a direct way of joining Facts to Dimensions. The original ID is swapped against the surrogate key. This however can be both compute consuming and complex so depending on the use case it can be good to just keep the original IDs and handle the joins with those keys instead. For very fast querying though it makes sense to use simple joins such as key <-> key.

Early arriving facts is a term where the events occur before the dimensional data has arrived, an item has been sold before the item description has arrived to the data platform. In that case it is easiest to create a dummy dimension row for that ID and update it once the dimensional data has arrived. Do not replace the IDs in the fact table with "unknown" or similar, that will destroy the data.

Aggregates

Aggregates can be important, especially for lowering query time and compute cost. These should be used whenever you see that the granular data is often summed up to day/week/year or in some other dimension.

They should however be used when needed rather than be created for every eventuality since if the underlying data changes you will need to reload the aggregates.

That is it for today, in a later post we will look at how to drive your data pipelines based on the data received rather than the processing date. (this is important)

要查看或添加评论，请登录

Olof Granberg的更多文章

Data Tips #25 - Handling re-runs

2024年5月31日

Data Tips #25 - Handling re-runs

Today I want to look at a topic that happens to all teams running a data platform. How do we handle re-runs in the best…
Data Tips #24 - Data Retention & Data Deletion

2024年5月20日

Data Tips #24 - Data Retention & Data Deletion

In this article we are looking at Data Retention and Data Deletion inside the data platform. There are two typical…

2 条评论
Data Tips #23 - Organisation 3. Platform and facilitating roles

2024年5月10日

Data Tips #23 - Organisation 3. Platform and facilitating roles

This article is a continuation on the organisation track. Please se previous articles: 1.
Data Tips #22 - Organisation 2. Surrounding roles

2024年5月3日

Data Tips #22 - Organisation 2. Surrounding roles

Today I am following up the last organisation article with an article describing some of the roles surrounding the data…
Data Tips #21 - Putting effort where it matters

2024年4月19日

Data Tips #21 - Putting effort where it matters

Today I want to adress something that is very important. That is putting effort where it is effectful and really…

1 条评论
Data Tips #20 - Organisation 1. Roles in a data team

2024年4月12日

Data Tips #20 - Organisation 1. Roles in a data team

Starting off a series of organisation articles to discuss different roles that you may need in your organisation to…

3 条评论
Data Tips #19 - Data Quality 4. Practical examples of the impact of data quality

2024年4月5日

Data Tips #19 - Data Quality 4. Practical examples of the impact of data quality

We have discussed data quality in some different dimensions and angles in previous articles. It does get a bit abstract…

5 条评论
Data Tips #18 - Data Quality 3. Data Quality in the Data Platform

2024年3月29日

Data Tips #18 - Data Quality 3. Data Quality in the Data Platform

We have in the two previous articles discussed what are different dimensions that can be measured and the holistic view…

2 条评论
Data Tips #17 - Data Quality 2. The holistic view

2024年3月22日

Data Tips #17 - Data Quality 2. The holistic view

In the last article we discussed the different dimensions to measure data quality in. There are no right and wrong…

2 条评论
Data Tips #16 - Data Quality series: 1. The basics

2024年3月15日

Data Tips #16 - Data Quality series: 1. The basics

First off we have to talk about what Data quality is and why it is important. Data quality is a series of measurements…

4 条评论

See all articles

Olof Granberg的更多文章

Data Tips #25 - Handling re-runs

Data Tips #24 - Data Retention & Data Deletion

Data Tips #23 - Organisation 3. Platform and facilitating roles

Data Tips #22 - Organisation 2. Surrounding roles

Data Tips #21 - Putting effort where it matters

Data Tips #20 - Organisation 1. Roles in a data team

Data Tips #19 - Data Quality 4. Practical examples of the impact of data quality

Data Tips #18 - Data Quality 3. Data Quality in the Data Platform

Data Tips #17 - Data Quality 2. The holistic view

Data Tips #16 - Data Quality series: 1. The basics

社区洞察