登录查看更多内容

HOOK vs Data Vault: Modelling

Andrew Foad

发布日期: 2023年6月21日

Data Modelling; means many different things to many people. I can’t remember where the following quote came from, but it goes something like this:

If you ask ten different modellers to build a data model, you’ll get eleven different models.

I think the joke is that one of the modellers couldn’t even agree with themself. And it is so true. Data modelling is such an abstract and sometimes cerebral activity that it is almost impossible to build a model that everybody can agree upon; and especially true for data warehouse modelling when we develop incrementally, and the modeller doesn’t always have a complete view of the whole business landscape. We rarely end up with the same answer because we can’t follow a prescribed process. Ten modellers >> Eleven models.

The Scenario

I will recycle an example I used recently in my presentation at the Knowledge Gap data modelling conference earlier this year (https://youtu.be/H_GOXnW9fXY). However, in this article, I will go into more detail.

The model below might represent a healthcare provider such as an Optometry practice.?

A Customer may buy any number of Products (lenses, frames, contact lenses, cleaners, etc.) as part of a Sale. Optionally, the Customer might have made an Appointment for some treatment or a consultation, and a Sale Item may (or may not) be related to an Appointment event. I might purchase new lenses after an eye test appointment or buy new spectacle frames for those lenses without needing a consultation.

Let us assume that this model represents the source system from which we will load data into the data warehouse. There are some interesting aspects of the data that I want to highlight.

The design of the source database makes heavy use of surrogate keys as the primary keys. All fields suffixed with “_ID” are non-intelligent surrogate keys. Fields shown in bold and annotated with “PK” are part of the primary key for that table. Fields annotated with “FK” represent foreign keys referencing primary keys on related tables. The arrows represent relationships, and a dotted line indicates that the relationship is optional. Finally, fields annotated with “BK” are business keys, which are generally human-readable codes used within the business. Our example model has Customer_No and Product_Code annotated as business keys.

Regarding our core business concepts (CBCs), we have four: Customer, Sale, Appointment and Product. And to save space on the diagrams, I will omit any audit fields such as Load_Timestamp and Load_Source (unless needed to ensure the model makes sense). Just assume they are on each of the warehouse artefacts.

To break with the convention of my previous posts, I will start with how we would model this with HOOK.

HOOK Solution

With the HOOK approach, we analyse each source table to pick out any business keys. Or, more precisely, we pick out all the identifying keys; or even more precisely, we pick out all the identifying keys for business concepts. Here goes…

That’s it. It didn’t take much thought, but I’ll step you through it one table at a time.

Customer

On inspection, we have two keys on the table that identify a customer: Customer_ID, the surrogate primary key for the table; and Customer_No, the business key we refer to within the business whenever we correspond with/about a customer. Rather than picking just one of these keys, we can freely use both, aligning them to the Customer CBC by creating Hook Keys prefixed with “HK_CUSTOMER”. Of course, we can’t have two fields with the same name, so we qualify each name to make them unique. Again, by convention, we separate qualifiers using a double underscore “__”.

We must also qualify the key values themselves with a KeySet. Let’s label these KeySets as {CUST.ID} and {CUST.CD}. If we were to represent the Bag as a view, the definition would look something like this:

CREATE VIEW [BAG CUSTOMER]
AS
SELECT HK_CUSTOMER__ID ??= CONCAT(‘CUST.ID|’, Customer_ID)
????, HK_CUSTOMER__Code = CONCAT(‘CUST.CD|’, Customer_Code)
????, *
?FROM CUSTOMER

You can see how the Hook Keys are formed by concatenating the KeySet with the relevant business keys.

Product

The model for Product mirrors what I just showed for Customer. Again we have two keys identifying the same CBC.

CREATE VIEW [BAG PRODUCT]
AS
SELECT HK_PRODUCT__ID??= CONCAT(‘PROD.ID|’, Product_ID)
????, HK_PRODUCT__Code = CONCAT(‘PROD.CD|’, Product_Code)
????, *
?FROM PRODUCT

We use KeySets {PROD.ID} and {PROD.CD} to qualify their respective business keys within the Hook Key.

Sale

The process is precisely the same for the Sale table. We look for business key identifiers and project them out as Hook Keys. In this case, we have two identifiers, Sale_ID and Customer_ID. Again we need to assign a KeySet to each. Let’s say the Sale_ID is part of the KeySet labelled {SALE.ID}, and the Customer_ID as a foreign key to the Customer table must have a KeySet or {CUST.ID}.

CREATE VIEW [BAG SALE]
AS
SELECT HK_SALE? ? = CONCAT(‘SALE.ID|’, Sale_ID)
? ? , HK_CUSTOMER = CONCAT(‘CUST.ID|’, Customer_ID)
? ? , *
?FROM SALE

In this case, there was no need to qualify the Hook Key names, but you would be free to do so if you feel it clarifies the model.

?Appointment

We will build the Appointment Bag in the same way as we did for the Sale Bag. We have a new business key Appointment_ID which identifies the Appointment CBC and assign the Keyset {APPT.ID}.

The Bag view looks like this.

CREATE VIEW [BAG APPOINTMENT]
AS
SELECT HK_APPOINTMENT = CONCAT(‘APPT.ID|’, Appointment_ID)
????, HK_CUSTOMER???= CONCAT(‘CUST.ID|’, Customer_ID)
????, *
?FROM SALE

Sale Item

The Sale Item table is a little bit different. It does not contain an identifying key for the Sale Item CBC because there is no CBC for Sale Item. However, the Sale Item table does have business key identifiers for Sale, Appointment and Product CBCs.

CREATE VIEW [BAG SALE ITEM]
AS
SELECT HK_SALE???????= CONCAT(‘SALE.ID|’, Sale_ID)
????, HK_PRODUCT????= CONCAT(‘PROD.ID|’, Product_ID)
????, HK_APPOINTMENT = CONCAT(‘APPT.ID|’, Appointment_ID)
????, *
?FROM [SALE ITEM]

You might have noticed that I drew no relationships in the model. I could have, but it isn’t necessary. We can, however, use a simple and powerful way to represent all relationships in the model with a Kimball-style BUS matrix.

Down the lefthand-side, we have our list of Bags. Across the top, we have our CBCs. The intersecting cells indicate which CBCs each Bag references and which KeySets we used.

We can tell how each of these Bags is related to one another at a glance. For example, I can see that I can join the Appointment and Customer bags because they both contain Hook Keys that use the same KeySet {CUST.ID}. I can also see that I can join the Sale Item Bag to the Appointment Bag {APPT.ID}, to the Product Bag {PROD.ID}, and to the Sale Bag {SALE.ID}.

Data Vault Solution

OK folks, strap in; it’s Data Vault time. It will be a long journey…

The first step in the modelling process is to identify the Hubs. As we have already established, we have four CBCs (Customer, Appointment, Sale and Product), which we can equate to four Hubs. Little or no thought is required.

That’s not entirely true. Depending upon which modelling camp you sit 
within, the definition of a Hub isn’t always clear-cut.

If you use a “traditional” Data Vault modelling approach as the Data Vault
Alliance prescribes, we have the four Hubs that I listed. If we use an 
Ensemble modelling approach, then, in addition, we would create a Hub 
representing the Sale Item.

I will discuss this in a little more detail later in this article.

The problems come when we look at modelling Links. Even with this relatively simple model, there are many, many ways we might choose to model this. As the number of Hubs grows, the potential number of possible Links grows exponentially. For a Data Vault model with H hubs, there are in the order of 2H possible Links. Consider a model with 100 Hubs; then, there are around 1,300,000,000,000,000,000,000,000,000,000 possible permutations of Hubs that could exist as a Link. Of course, they can’t all exist. At most, there can only be as many links as we have ingested tables, but it should give you a sense that there are many potential ways we can model the same data.

领英推荐

Cracking the Code: How to Tell a Story with Your…

Quantum Analytics NG 1 年前

Data Strategy Essentials

Boston New Technology (BNT) 7 个月前

The importance of data visualization in Market Research

codeit 1 个月前

Looking at the source data model, we can pick out at least four links representing the business key associations within the data.

[Customer, Sale]
[Customer, Appointment]
[Sale, Product, Appointment]
[Sale, Product] – For when no Appointment is involved in a Sale Item

And, depending upon how strict we want to be about only using recognised business keys, we might want to include a couple of Same-As links to map the dual keys for Customer and Product. If we don’t use the Same-As Links, we will have to perform a look-up to obtain the proper business keys when ingesting data from tables that don’t contain the true business key. I discuss this in the article: “HOOK vs Data Vault : Pre-Joins for Business Keys”.

Customer

Typically, we will build the Data Vault model incrementally. I’ll start with the Customer data. As I said before, we will attempt to use the “true” business key when constructing the Data Vault artefacts.

The Customer data will translate into a Hub /Satellite pair as follows:

Notice how we have put the customer business key (Customer_No) in the Customer Hub. We place the remaining fields in the Customer Satellite. We also left the surrogate key (Customer_ID)?in the Satellite. If we had decided to pull out both keys, then we would probably have to model this using a Same-As link as follows:

The Same-As Link (SAL) refers to the Customer Hub twice, once for each Customer identifying key (Customer_ID and Customer_No).

Whether the choice/judgement you made was correct will only become apparent as we add to the model.

Product

We can model Product in the same way, and we have the same choice to make:

Which choice is better? Only time will tell, but it is another decision we, as modellers, have to make.

Appointment

Next, we’ll look at the Appointment data and again find we have choices to make. We’ve already established that we have a CBC called Appointment. Therefore, we’ll have an Appointment Hub, and it would make sense to hang the Appointment data as a Satellite from this Hub.

However, we still need to record the association/relationship between the Appointment and Customer, which we represent as a Link. The model might look something like this:

On the face of it, this seems OK, but we have lost something. Can we answer the question, “When did the relationship between a Customer and an Appointment form?” And what happens if an Appointment is changed and assigned to a different Customer? Would we know which Customer is currently associated with the Appointment? The relationship between Appointment and Customer lacks any context.

We could add the context by adding another Satellite that hangs off the Appointment Link. Or we could hang the Appointment Satellite off the Appointment Link rather than the Appointment Hub.

Which way is best? Ask ten modellers and see what answers you get.

Sale

We have the same decision to make for the Sale data. Which of these do we pick?

Sale Item

With Sale Item, we step things up a notch in complexity. We now have three business keys and, therefore, three Hubs to play with: Sale, Product and Appointment. With three Hubs, there are at least four possible Links:

[Sale, Product]
[Sale, Appointment]
[Product, Appointment]
[Sale, Product, Appointment]

The simplest model, at least in my mind, is to go with the three-way Link as follows:

However, you could argue that because the association with Appointment is optional, the “true” relationship is between Sale and Product. You may be familiar with terms like “Semantic Grain” and “Driving Keys”, which you must be aware of when building a Data Vault model. Taking that into account, the model would look more like this:

Semantically, this model is more accurate than the first version. The relationship between Appointment and Sale probably has nothing to do with an individual line item within a Sale. I say probably because the data in the source doesn’t explicitly tell us that. It’s an assumption that I’m making based on common sense, but I could be completely wrong. The point is that I have no confidence in which representation to go with. So, pick one and hope for the best.

?Unfortunately, we’re still not entirely done. As I mentioned earlier, we might take a different approach if we used Ensemble Modelling. The model is organised around Ensembles with this approach, as shown below.

The diagram shows Ensembles as the boundaries drawn by the three circles with dotted lines. At the centre of each Ensemble is a Hub, and then Ensembles are connected via Links that sit on the shared?Ensemble boundaries. Contextual information is attached to each Hub in the form of Satellites. Using this approach, we must represent Sale Item as a Hub. The model now looks very different.

Again, I’m not entirely sure I’ve modelled this correctly. I think much of the model is open to some degree of debate which is, in itself, a bit of a problem. Choice and yet more choices. Therefore, less confidence that we got the model right.

I’m not going to attempt to draw the whole diagram that covers all five source tables (Customer, Appointment, Sale, Sale Item and Product); it would be far too complicated to show, and it would only represent one possible option from the many design choices I had to make. And I can guarantee that whatever I did, 90% of you reading this would disagree with it.

Analysis

As we have seen, there is a very low burden on modelling in a HOOK implementation. You don’t have to think too much about how to represent the data. You simply pick out the identifying keys for any core business concepts within each source table and project them out as Hook Keys (making sure to qualify them with the appropriate KeySet). You don’t even need to think about the relationships. There really is only one way to model each HOOK artefact. In a sense, HOOK is data modelling “lite”. From a business perspective, the extent of the modelling is identifying the CBC in the first place.

Conversely, using Data Vault, we had many choices to make. And even after we have made those choices, we have low confidence that we got it right, which probably means that in the future, we’ll have to make changes. In other words, we are almost guaranteed to create technical debt when building a Data Vault warehouse, especially if we build it incrementally.

Conclusions

I’ll leave the conclusions to you, but I hope I have shown that with HOOK, the modelling process is almost trivial, especially when compared to an equivalent Data Vault model. With HOOK, there are few, if any, modelling decisions to make. I am confident that if you asked those ten data modellers for a HOOK data model, you’d only get one model back (well, maybe two ??).

Remco Broekmans

VP International Programs at Genesee Academy, LLC

1 年

Hi Andrew, Nice write up and provides good insight in the HOOK approach. I love your use of the matrix to captur the relationships between the bags., However I do not agree with your DV - model at some points. The main one is your example of the SAL! In your solution of the SAL you more or less are using the Product_ID (alternative identifier) within the SAL - wghereas the SAL is a reference between the PK's of a Hub (that you canb use the logic is of course an option to provide you this information). More importantly the use of the SAL doesn't move your Satellite from the Hub to the SAL. It will just stay on the Hub - it describes the business identifier you store in the Hub. See picture for this part of the model! And yes, we 2 modellers already came now with 3 different solutions ;-D.

1 次回应

David Jaques-Watson

Senior Data Modeller at NCS Australia

1 年

I used to tell you if you have three data modellers in a room you'll get five different models - but I probably stole it from someone! Maybe Graeme Simsion, perhaps. (Always steal from the best!)

5 次回应

Uli Bethke

Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion

1 年

Should there be a relationship between Bag Sale and Customer in your matrix?

1 次回应

Brand

1 年

Thanks for Sharing! ?? Andrew Foad

1 次回应

查看更多评论

要查看或添加评论，请登录

Andrew Foad的更多文章

Is Data Modelling Dead?

2025年2月24日

Is Data Modelling Dead?

I’ve been using data modelling techniques for most of my 35-year career, and I consider it an essential skill for…

37 条评论
Data Vault is Garbage (change my mind)

2025年1月20日

Data Vault is Garbage (change my mind)

If you are unfamiliar with the Hans Christian Andersen story of “The Emperor’s New Clothes”, ChatGPT summarises it…

7 条评论
Introducing HOOK. Part 1 - Logical Architecture

2024年1月21日

Introducing HOOK. Part 1 - Logical Architecture

This article is the first in a short series on the Hook approach to Data Warehousing. As I said in my introduction…
Is “Agile Data” an Oxymoron?

2023年10月15日

Is “Agile Data” an Oxymoron?

In last week’s article, I discussed how you can “boil the ocean” when building a HOOK data warehouse. The article…

11 条评论
Always try to boil the ocean!

2023年10月8日

Always try to boil the ocean!

Recently, I have felt very left out by all these Snowflake folk with their superhero avatars. As I don’t work for…

16 条评论
HOOK and the Unified Star Schema

2023年10月3日

HOOK and the Unified Star Schema

First, we should probably understand what a Unified Start Schema is before we start talking about how we can use it in…

6 条评论
HOOK vs Data Vault: Willibald Part 6

2023年9月17日

HOOK vs Data Vault: Willibald Part 6

As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.
HOOK vs Data Vault: Willibald Part 5

2023年9月8日

HOOK vs Data Vault: Willibald Part 5

As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

10 条评论
The Attempted Assassination of ELM

2023年8月28日

The Attempted Assassination of ELM

Around the middle of the 19th Century, the students of a public boy’s school in the English Midlands invented a game. A…

13 条评论
HOOK vs Data Vault: Willibald Part 4

2023年8月7日

HOOK vs Data Vault: Willibald Part 4

As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

5 条评论

See all articles

HOOK vs Data Vault: Modelling

Andrew Foad

The Scenario

HOOK Solution

Data Vault Solution

领英推荐

Analysis

Conclusions

Andrew Foad的更多文章

社区洞察

其他会员也浏览了

Demystifying Dimensionality Reduction: Simplifying Complex Data Analysis

Helpful Tips on How to Turn Data and Information into Solutions

What is Data Normalization? And Why Is It Important To Do Before Data Visualization?

11 Essential Plots That Data Scientists Use 95% of the Time

Mastering Data Visualization: Striking the Right Balance Between Statistical Significance & Visual Appeal ???

Transform → Visualize → Model for Sequence Analytics

The Future of Data Visualization: Predictions for 2019 and Beyond

Model Evaluation Metrics in Data Science

Unlocking the Power of Time Series Data: How Low-Code Platforms are Simplifying Predictive Analytics

Unlocking Data Insights with Principal Component Analysis (PCA)

The Scenario

HOOK Solution

Data Vault Solution

领英推荐

Analysis

Conclusions

Andrew Foad的更多文章

Is Data Modelling Dead?

Data Vault is Garbage (change my mind)

Introducing HOOK. Part 1 - Logical Architecture

Is “Agile Data” an Oxymoron?

Always try to boil the ocean!

HOOK and the Unified Star Schema

HOOK vs Data Vault: Willibald Part 6

HOOK vs Data Vault: Willibald Part 5

The Attempted Assassination of ELM

HOOK vs Data Vault: Willibald Part 4

社区洞察

其他会员也浏览了

Demystifying Dimensionality Reduction: Simplifying Complex Data Analysis

Helpful Tips on How to Turn Data and Information into Solutions

What is Data Normalization? And Why Is It Important To Do Before Data Visualization?

11 Essential Plots That Data Scientists Use 95% of the Time

Mastering Data Visualization: Striking the Right Balance Between Statistical Significance & Visual Appeal ???

Transform → Visualize → Model for Sequence Analytics

The Future of Data Visualization: Predictions for 2019 and Beyond

Model Evaluation Metrics in Data Science

Unlocking the Power of Time Series Data: How Low-Code Platforms are Simplifying Predictive Analytics

Unlocking Data Insights with Principal Component Analysis (PCA)