HOOK vs Data Vault: Willibald Part 2

HOOK vs Data Vault: Willibald Part 2

Following on from the introduction to the Willibald case study, this time I will begin to explore the modelling aspects of Data Vault compared to HOOK. I will be stepping through each part of the model in each article.

As a reminder, the diagram below represents the Willibald source database. It consists of only eleven tables.

No alt text provided for this image

I will focus on two specific source tables from the model, the Customer and Club tables. I will show that this seemingly simple data structure is more challenging to model than you might first think.


Customer and Club

As discussed in the last article, I decided that Customer and Club are two of five core business concepts (CBCs) defined for Willibald. We will use these concepts to arrange the data in the Customer and Club source tables.

No alt text provided for this image

Interestingly, these two tables contain foreign key references to the other (I have highlighted the respective fields in the diagram). These kinds of bilateral foreign keys are often difficult to manage in the source system as they are what I would call a “Catch-22” association. Suppose I have a Customer ‘A’ and Club ‘B’ that refer to one another. I cannot insert ‘A’ unless I have already populated ‘B’, and I cannot insert ‘B’ unless I have already inserted ‘A’.?

Fortunately, this isn’t a problem that the data warehouse needs to worry about, but we do need to capture the relationship between the two concepts.

Data Vault

We have two CBCs and two relationships between them. The first and easiest conclusion is that we will need two Hubs, one for each CBC. The second consideration is how to model the Link (or Links) between the two CBCs.

As is often the case, we could model these relationships differently.

Option 1 - Simplest

The first option is probably the simplest. Regardless of which relationship we are looking at, they both involve the association of the Customer and Club CBCs. We can construct a Link for this pairing that connects the corresponding Hubs. We can then hang the contextual data as Satellites from the two source tables from this Link, as shown below:

No alt text provided for this image

It seems simple enough. But maybe not. On closer inspection of the source, the Customer and Club table data will show us that the foreign key relationships are not mandatory. At this point, seasoned Data Vault modellers will start talking about Semantic Grain and Unit of Work (UoW), so let’s delve deeper.

The two relationships between the Customer and Club tables have different meanings. They are both one-to-many relationships but in opposite directions. The Customer to Club foreign-key relationship has a semantic grain of Customer; the Club to Customer foreign-key relationship has a semantic grain of Club.

For example, say the Customer/Club link table contained business keys as follows:

No alt text provided for this image

I can see from the first five records that there are relationships between specific Customers and Clubs, but what isn’t clear is in which direction we read the relationships. The Customer ID is unique within the first seven records, suggesting that the semantic grain is Customer. We appear to have covered the foreign-key relationship from Customer to Club. But what of the foreign key from Club to Customer? Is that relationship represented too??

And how do we interpret the sixth, seventh, eighth and ninth rows? Here we have optional values, suggesting that we read the relationship in one direction for some records but the opposite direction for others. ??

One question I might ask: Can a Customer belong to more than one Club? Or, can a Club have more than one administrative contact? The source system database suggests not, but what if those business rules change? We now have a many-to-many relationship to deal with. The link table can handle this, but the semantic meaning becomes cloudier.

Therefore, if we represent both relationships using the same link table, we lose the meaning of both relationships; in other words, we have semantic ambiguity. Maybe we care about that; perhaps we don’t.

Option 2 – Multiple Links

Let’s assume that we do care. If we want the Data Vault model to represent the relationships from the underlying source model, we might create two different links: one with a semantic grain of Customer; the other with a semantic grain of Club, as shown below.?

No alt text provided for this image

The resulting model is, perhaps, much more explicit regarding its meaning, but we’ve had to create two links rather than the one we had with Option 1.

Let’s swing back and see what the sample link data from Option 1 might look like.?

No alt text provided for this image

Semantically, the picture is clear if we read the relationships from left to right. The data on the left-hand side (LSat Customer Club) shows which Club a Customer is a member of. The data on the right-hand side (LSat Club Customer) shows Customers that are the contacts for each Club.

It might not seem like too much drama, but consider how we might handle relationships between three, four or five different business concepts. The permutations of possible links grow exponentially with each additional business concept; therefore, the chances of making bad design decisions are also significantly increased. ?

Option 3 – Horizontal Partitioning

Another approach is to split the inbound data horizontally.

All records in the Customer table will have a Customer identifier, but the Club identifier may or may not be populated. We can split the data into two mutually exclusive sets: those where the Club identifier is not populated; and those where it is. We can create a hub satellite with the former and hang it from the Customer Hub, then create another link satellite with the latter and hang it from the Customer Club link.

Equally, all records in the Club table will have a Club identifier, but the Customer identifier may or may not be populated. We can split the data in the same way. The overall model will look like this.

No alt text provided for this image

Analysis

Which of these three options should we choose? They are all correct, and we can easily justify each of them. Hand on heart, I have no clue which of them we should use. Furthermore, I’m not sure I’ve explored all the possible options; there could be even more alternatives. For me, that is the problem with Data Vault modelling. Choice introduces anxiety into the process as there will be concerns over the consequences of choosing the wrong option.

Ensemble Modelling (ELM)

But the Data Vault options don’t stop there. If we use the Ensemble Logical Modelling (ELM) approach, the model will look very different, and again there are variations.

Option 1

With ELM, there are no Link Satellites. All contextual information must be attached to a hub.

For the Willibald example, this makes sense, and we can hang the Customer data from the Customer Hub and the Club data from the Club Hub. We record the relationships between the Customer and Club concepts in links between the two hubs, one for each relationship.

No alt text provided for this image

However, we cannot use this model to record relationship changes. For example, say Customer ‘A’ is a member of Club ‘B’ and then later joins Club ‘C’ and leaves Club ‘B’. Assuming the Data Vault implementation is write-only (as it should be), we have no way of end-dating the Customer ‘A’ is a member of Club ‘B’ relationship. To solve this issue, we need a different model.

Option 2

We can end-date a relationship by recording the information in a satellite, but as we already know, we can’t attach satellites to links in the ELM approach. Therefore, we have no option but to create two new hubs, as shown below.

No alt text provided for this image

We still have the two links representing the Customer/Club relationship, but now we have created two new hubs that we attach, so we now have two three-way links. We can directly connect the relationship history to the new Hubs.???

In other words, if we want to record the history of a relationship, then we must create a hub that represents that relationship, from which we can hang a satellite which tracks the history of the relationship.

Analysis

There are pros and cons to the ELM approach. By following the ELM process, we should consistently arrive at the correct model. In this example, the resulting model depends upon whether we want to record the history of the Customer/Club relationships. There is less guesswork when using the ELM approach.

However, as you can see, especially in the second option, we must split the data from the two source tables into ten tables (four hubs, two links and four satellites). I’m sure proponents of ELM and Anchor modelling will shrug their shoulders and say, “Ten tables? So what?”, which is fine, but I prefer something more straightforward.

HOOK

Let’s see what the equivalent HOOK model looks like.

No alt text provided for this image

I probably spent around two seconds figuring out how to model this. There isn’t much to think about. The only consideration is the order in which we want to present the Hook Keys. In the Customer bag, I listed the Customer hook (HK_CUSTOMER) first. In the Club bag, I reversed the order. Does the order matter? Not one bit, but if it helps to clarify the direction of a relationship, then you should settle on a suitable convention that makes it clear. For example, you could qualify the names of the Hook Keys to describe the relationship:

  • In the Customer bag, HK_Club becomes HK_Club__Is_Member_Of, indicating that the Customer is a member of the Club.
  • In the Club bag, HK_Customer becomes HK_Customer__Is_Contact_For, indicating the Club has a contact Customer.

Perhaps you can think of alternative/better naming conventions to help clarify the nature of the underlying relationships (if you feel that is necessary).

But maybe I am trivialising the effort. We can use other options in the HOOK approach to clarify relationships between concepts.

As you probably know, there is no Link construct in HOOK, but there is a mechanism we can use to represent relationships between CBCs. So far, I have shown that Hook Keys align with CBCs, singular CBCs. But we can construct Hook Keys from more than one CBC. We call these compound or composite Hooks, and we can define as many as we choose.

In this example, we could define a Hook made up of the Customer and Club (or Club and Customer if you care about the ordering). When constructing the Hook Key, we can concatenate the Customer and Club business keys into a single value.

So if we defined the Bag as a view over a data lake table, it might look like this.

CREATE VIEW bag.Custome
AS
SELECT HK_Customer ??= CONCAT(‘CUST|’, CustomerID) ?
???? , HK_Club???? ??= CONCAT(‘CLUB|’, ClubID)
???? , HK_Membership = CONCAT(‘CUST|’, CustomerID, ‘~’, ‘CLUB|’, ClubID)
???? , *
? FROM lake.Customerr        

See how I have defined a new CBC called ‘Membership’ that represents the relationship between a Customer and Club. We just concatenated the business keys separating them with the tilde (~) character.

It’s just a matter of preference whether you represent relationships this way. Personally, I think it is overkill. Relationships are generally implied by the data. Unless you believe it is essential to highlight relationships in the Hook model explicitly, then I wouldn’t bother.

Analysis

There isn’t too much to analyse for the HOOK model. If we have two source tables, we will have two Hook Bags unless you decide to partition the data. For example, by splitting vertically for rate-of-change, PII, etc., or horizontally to group records of a similar type together.

Either way, I hope I have shown how much simpler a Hook model is if you compare it to an equivalent Data Vault model.


Next Up

My next article will focus on the Product and Product Category tables. There are some intriguing aspects to consider, especially for Product Category. For example, is it reference data? And how do I represent hierarchies? Until then…

Remco Broekmans

VP International Programs at Genesee Academy, LLC

1 年

Nice write up and of course for me only the ELM approach counts ;-). What my issue is with only one linke between the two or as you domin Hook no link at all you always need to look at the data and/or tables to understand what relationships there are. Of course in ELM it is direct visible if you look at the model, easier to see and easier to explain. I do miss your little matrix here which you used before to show what relationships there are possible, imo that solves part of the “wanna know my relationships? Look at the data!” idea. Looking forward to your next post. Ps end of this week I will post he remainder ELM templates plus the logical Data Vault model.

I have another alternative to your dv model options: Hub customer, hub club, link hubcustomer, sat customer(hub customer), sat club(hub club) rts club(link hubcustomer), rts customer(link customer). This option can be loaded without any prelogic, with zero data loss and no dependencies between the loading processes of club and customer. The only concern is that link clubcustomer can be loaded from two processes at the same time and propper locking strategy has to be in place to avoid the risk of key violation constraint.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了