HOOK vs Data Vault: Reference Data

HOOK vs Data Vault: Reference Data

How do we model Currency in the data warehouse? Is it reference data or a core business concept (CBC)? The answer to that question may depend upon who you put the question to. If you are a local flower shop, you can safely treat Currency as reference data. If you are an international bank, then Currency will have greater significance to the day-to-day running of your business. However, the distinction might not always be so obvious when we get around to modelling the data.

In this article, I will work through an example and show that with Data Vault, the way we model Currency can be confusing. I will also show how we might approach the same problem in HOOK. Although this discussion focuses on Currency, the same reasoning applies to other business concepts that walk the fine line between CBC and reference data.


The Scenario

The following model represents a snippet from a more extensive financial model with Currency at the centre.?

No alt text provided for this image

There are four tables:

  • Customer – that contains details of customers
  • Account – details of financial accounts held by customers
  • Currency – A list of standard ISO-4217 currencies
  • Exchange Rate – Records the fluctuations in exchange rates between pairs of currencies.


Data Vault Solution

Looking at the source data model, we can see that from the perspective of Customer and Account tables, Currency looks very much like reference data. So I will model it that way and see how it looks. I’m also going to assume that Customer and Account are core business concepts (CBCs), and as such, we will have Customer and Account hubs.

No alt text provided for this image

?I’ve chosen to model Customer as a Hub/Satellite pair. However, I modelled Account as a Link Satellite hanging off the Account Link, which joins the Account and Customer Hubs.

Both Satellites refer to the Currency reference table. Notice the dotted lines indicating the relationships are optional; therefore, we do not set any referential integrity constraints, which is a general rule for Data Vault for reference tables.

So far, so good. Everything seems to make sense, and we’ve had no issues or difficult modelling decisions. Let’s now model the Exchange Rate.

No alt text provided for this image

We immediately hit a problem. I have used all the information from the source Exchange Rate table, but we don’t have anything left that we can use to hang off a Hub or Link. There are no recognised business keys in this data source.

How do we solve this problem? I can think of three possible solutions, none of which is ideal.

Option 1 - More Reference Data

Why not treat the Exchange Rate information as reference data? Something like this.

No alt text provided for this image

Is this really what we want to do? We now have reference data referencing more reference data; we are effectively snowflaking the reference data. It feels like a bad idea. Where do we stop with this approach? If we go down this path, we might as well abandon Data Vault altogether and aim for a normalised Inmon-style model.

Option 2 – Make it a CBC

Going in the opposite direction, why don’t we adopt Currency as a CBC and create a Hub for it??

No alt text provided for this image

That seems to work well. We have the Currency details stored in its own Satellite and the Exchange Rate information hanging from a Link that pairs together two Currencies.

But aren’t we forgetting something? What about the model we built for Account and Customer? If we refactor the Currency reference table as a regular Hub/Satellite combo, we will also have to change all other warehouse artefacts that referred to it before.

In our example, we would have to add Currency to the Account Link or create a new Account Link which includes Currency. We would also have to create a new Customer link that associates Customer and Currency and hang the Customer Satellite from it.

No alt text provided for this image

Suddenly, the model is starting to look a bit messy. If you think about this a bit more, you might consider how the model might evolve as we add more reference data. What if we had a source table that contained twenty different reference fields (not unheard of)? If we modelled them all as CBCs with a Hub for each, we’d need to build a link table containing more than twenty Hub references. The model is now getting to a point where it is beyond our control.

Option 3 – Why not both?

The third option is to represent Currency in the model as reference data and as a CBC. For the Exchange Rate data, we treat Currency as a CBC. For the Account and Customer data, we treat Currency as reference data.

If you pushed for an answer from the Data Vault Alliance, I would guess they would go with this option. But my primitive data modeller’s brain is screaming at me, telling me this is an accident waiting to happen. In effect, we are storing the same data twice, and we have no guarantees that the two versions of the data will be the same. ??


Three options, none of them ideal. Are there any other that I haven’t considered?


HOOK Solution

With HOOK, things are a little more straightforward. As before, we analyse each source and pick out the business keys or the keys that identify CBCs. The question, therefore, is whether Currency is a CBC or not. If we say it is, we will most likely end up with hundreds and hundreds of CBCs because reference data is ubiquitous; it is quite literally everywhere. We want to keep the number of CBCs down to a manageable size.

I discuss this in more detail when I explain the significance of Dunbar’s Number in my substack article on the Business Glossary.

HOOK uses a different approach to exposing key values for reference data that still looks and feels like a regular CBC Hook Key. We create a catch-all CBC called REF that we can use for all reference data types. REF is too broad and generic for us to consider it a CBC, so we flag it as a Weak CBC. Then to make it even more obvious, we can use a different Hook Key qualifier to distinguish them from regular CBC Hook Keys.

We will use the prefix “WK_” for Weak CBCs in this example. Regular CBC-based Hook Key will be prefixed, as before, with “HK_”. Our HOOK model will look like this:

No alt text provided for this image

Weak or otherwise, we construct the Hook Keys in the same way, and we must attach a KeySet. As the Currency in this example is from an ISO standard, I would probably assign it a KeySet {ISO4217}. If we chose to build the Exchange Rate Bag as a view, the SQL DDL would look something like this:


CREATE VIEW [BAG Exchange Rate
AS 
SELECT WK_Ref__Currency__From = CONCAT(‘ISO4217|’, From_Currency_Code)
????, WK_Ref__Currency__To??= CONCAT(‘ISO4217|’, To_Currency_Code)
????, *
?FROM [Exchange Rate]]        

As in last week’s article (HOOK vs Data Vault: Modelling) we can summarise the relationships using a Kimball style BUS matrix.

No alt text provided for this image

Analysis

Reference Data in Data Vault has always (to me) felt like it was an afterthought and is not part of the core Data Vault architecture. And it shows. There are situations when the same data looks like reference data and simultaneously like a legitimate business concept. Depending upon how you treat it, it has a profound impact on the target data model.

On the other hand, HOOK treats reference data like any other source of information. The only difference is that we group all reference data under a common pseudo-CBC called REF. We always need to ensure the number of CBCs remains a manageable number.


Conclusions

What conclusions can we draw? There are cases where data looks like reference data, some cases where it doesn’t and others where it could be either. For Data Vault this is problem and can lead to wildy different models and, therefore, the need to make tough design decisions.?

And with HOOK? By introducing the concept of Weak CBCs we can accomodate reference data in the same way we would with any other type of data. There are no tricky design decisions here. The only thing we have to worry about is what are the core business concepts and which of them can be categorised as reference type data.?

要查看或添加评论,请登录

Andrew Foad的更多文章

  • Is Data Modelling Dead?

    Is Data Modelling Dead?

    I’ve been using data modelling techniques for most of my 35-year career, and I consider it an essential skill for…

    38 条评论
  • Data Vault is Garbage (change my mind)

    Data Vault is Garbage (change my mind)

    If you are unfamiliar with the Hans Christian Andersen story of “The Emperor’s New Clothes”, ChatGPT summarises it…

    7 条评论
  • Introducing HOOK. Part 1 - Logical Architecture

    Introducing HOOK. Part 1 - Logical Architecture

    This article is the first in a short series on the Hook approach to Data Warehousing. As I said in my introduction…

  • Is “Agile Data” an Oxymoron?

    Is “Agile Data” an Oxymoron?

    In last week’s article, I discussed how you can “boil the ocean” when building a HOOK data warehouse. The article…

    11 条评论
  • Always try to boil the ocean!

    Always try to boil the ocean!

    Recently, I have felt very left out by all these Snowflake folk with their superhero avatars. As I don’t work for…

    16 条评论
  • HOOK and the Unified Star Schema

    HOOK and the Unified Star Schema

    First, we should probably understand what a Unified Start Schema is before we start talking about how we can use it in…

    6 条评论
  • HOOK vs Data Vault: Willibald Part 6

    HOOK vs Data Vault: Willibald Part 6

    As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

  • HOOK vs Data Vault: Willibald Part 5

    HOOK vs Data Vault: Willibald Part 5

    As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

    10 条评论
  • The Attempted Assassination of ELM

    The Attempted Assassination of ELM

    Around the middle of the 19th Century, the students of a public boy’s school in the English Midlands invented a game. A…

    13 条评论
  • HOOK vs Data Vault: Willibald Part 4

    HOOK vs Data Vault: Willibald Part 4

    As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

    5 条评论

社区洞察