HOOK vs Data Vault: Willibald Part 4

HOOK vs Data Vault: Willibald Part 4

As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

No alt text provided for this image

This time we will be looking at how Willibald deals with Orders and the sale of their products.

In the first article in this series, you might remember that I talked about tweaking the original database design and making the OrderID column on both the Order and RoadshowOrder tables an auto-incrementing identity key. The idea is that there will be overlapping business keys across the two tables that refer to different orders. We must be able to distinguish between them.


Orders

Orders are the mechanism by which Willibald sells its products to its customers. There are two channels through which Willibald sell products: through an online web store; or direct to the public on their live Roadshow events. Each channel has its own tables in the Willibald system, highlighted in the model snippet below.

No alt text provided for this image

The RoadshowOrder table references Customer, Club and Product tables. The Order table only refers to Customer, and the OrderLine table hangs from it whilst referring to the Product table.

Which Core Business Concepts (CBCs) do we need to worry about? We already know about Club, Customer and Product, and now we have a new one called Order. We might ask whether we model Order as a single CBC or need two types: one for Web Store Order; and one for Roadshow orders. Well, if you even need to ask that question, I would suggest you go and read John Giles’ excellent The Elephant in the Fridge, where he describes some real-life examples that defy belief. In case you were wondering, it’s one CBC!

But what about OrderLine? The ELM approach would most definitely identify order lines as a distinct CBC; Data Vault, probably not; HOOK, almost certainly not.

Let’s start with Data Vault…


Data Vault

We know we have four CBCs and, therefore, four hubs. We have three source tables that we will need to map to some satellites. As there are multiple relationships, we also know we must implement some links.

The most straightforward modelling approach (at least, I think it is) is creating a satellite for each source table. If the satellite is associated with a single CBC, we hang it from the relevant hub; if the satellite is associated with more than one CBC, then we hang it from a suitably defined link — something like this.

No alt text provided for this image

This diagram represents just one solution, but there could be many others.

With four hubs, there are (a minimum of) eleven possible link structures. For example, should we model each binary relationship from the source as distinct links? We could debate the merits of each potential solution, but that would get boring really quickly.

And for the two Order tables, should we combine them into a single satellite? That might not be a great idea, but I’m sure somebody could make a good argument to do so.

The point is that there are numerous solutions and, therefore, many opportunities to get things wrong.


Ensemble Modelling (ELM)

As we’ve seen, the Ensemble modelling approach will yield a different structure as link satellites are not permitted. For simplicity, I’ve decided not to model any historical tracking of relationships:

No alt text provided for this image

As with the Data Vault solution, we might have fashioned many alternatives, but I have tried to show that both Web and Roadshow Orders are contained within the same Order Ensemble.

No alt text provided for this image

There are four other ensembles. We’ve already covered three of them, Customer, Club and Product, in previous articles; the fourth is Order Line.


HOOK

Let’s see what the equivalent HOOK model looks like.

No alt text provided for this image

Here we follow the same pattern as before, with a Bag defined for each source table. As always, it is a one-to-one. As with previous models, we can define the Bags as views over a data lake table (we can build them as physical tables if we choose to). They might look like this.

CREATE VIEW bag.Order
AS
SELECT HK_ORDER??? = CONCAT(‘ORD.WEB|’, OrderID)?
??? ?, HK_CUSTOMER = CONCAT(‘CUST|’, CustomerID)
?? ??, *
?FROM lake.Order        

And for the related order line information.

CREATE VIEW bag.OrderLine
AS
SELECT HK_ORDER?  = CONCAT(‘ORD.WEB|’, OrderID)?
? ?  , HK_PRODUCT = CONCAT(‘PROD|’, ProductID)
? ?  , *
?FROM lake.Order        

The Roadshow orders would look like this.

CREATE VIEW bag.RoadshowOrder
AS
SELECT HK_ORDER???= CONCAT(‘ORD.RS|’, OrderID)?
????, HK_CUSTOMER = CONCAT(‘CUST|’, CustomerID)
????, HK_CLUB????= CONCAT(‘CLUB|’, ClubID)
????, HK_PRODUCT?= CONCAT(‘PROD|’, ProductID)
????, *
?FROM lake.RoadshowOrder        

You may notice that the Hook Keys for Orders have different KeySets. For the Web Store Orders the business keys are prefixed with “ORD.WEB”; we prefix the Roadshow Orders with “ORD.RS”. Doing this ensures we don’t confuse the different types of orders. If, for example, we tried to perform a join between the Order and RoadshowOrder tables using the HK_ORDER hook key, the query would come up empty, which is precisely the behaviour we want. Whereas, if we tried to join using the OrderID field, there could be overlapping records, and the resulting query would return incorrect results.

Of course, we can do something similar with Data Vault by using the Business Key Collision Code or BKCC (I’ve said it before, it is terrible name!). In fact, I based the idea of HOOK KeySets on the BKCC. However, in Data Vault, we use the BKCC when convenient. In HOOK, KeySets are a mandatory component of the architecture. Every business key MUST be qualified with a KeySet.


Summary

As in the last article, I haven’t explored many options, preferring to stick to patterns I have used before. Does the HOOK approach deliver a simpler model?

I’ll let you decide. Here are the three versions of the Willibald Order model side by side:

No alt text provided for this image

Next Up

In Part 5, we will look at the Willibald Deliveries, which is an interesting part of the model, particularly from a HOOK modelling perspective, because we do have a decision to make on this occasion. Is a Delivery a CBC or not? The resulting HOOK model will be slightly different depending on the answer to that question. I won’t spend too much time looking at DV and ELM alternatives, as I’ve probably flogged that point to death. I will show that the modelling in HOOK is simple regardless of whether we consider Delivery a CBC, or not.

In Part 6, I will complete the picture and show versions of completed models for all three approaches (Data Vault, Ensemble and HOOK).

After that, I will present a simple physical implementation of a HOOK solution for the Willibald case study.?

One thing that I think is odd. The propsed dv solution is probably fine if you ask the dv community. But the question regarding wich links to be created, I only see one solution if you dont want to lose any data regarding the relations between the hubs. Can you guess how many links and what links I would have created?

回复
Rodrigo Bocanegra Cruz

Enterprise & Data Architect

1 年

Hi Andrew, just playing around with GPT and the Mermaid plugin to create a model from the scripts you provided. Very simple and straightforward. There are, of course, simpler ways to do this, like creating the scripts from metadata and getting the model from there, but you might get the idea. This means we can have a proper conceptual model maintained with the business and then complement it with this physical model. I think it is more suitable to compare the models with DataVault this way since the relationships and entities are set even if they are not persisted out of the views, but we are modelling anyway, aren't we? Cheers! https://chat.openai.com/share/c5e2b180-707a-4891-b3e6-f878dfa54953

  • 该图片无替代文字
回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了