HOOK vs Data Vault: Reference Data
How do we model Currency in the data warehouse? Is it reference data or a core business concept (CBC)? The answer to that question may depend upon who you put the question to. If you are a local flower shop, you can safely treat Currency as reference data. If you are an international bank, then Currency will have greater significance to the day-to-day running of your business. However, the distinction might not always be so obvious when we get around to modelling the data.
In this article, I will work through an example and show that with Data Vault, the way we model Currency can be confusing. I will also show how we might approach the same problem in HOOK. Although this discussion focuses on Currency, the same reasoning applies to other business concepts that walk the fine line between CBC and reference data.
The Scenario
The following model represents a snippet from a more extensive financial model with Currency at the centre.?
There are four tables:
Data Vault Solution
Looking at the source data model, we can see that from the perspective of Customer and Account tables, Currency looks very much like reference data. So I will model it that way and see how it looks. I’m also going to assume that Customer and Account are core business concepts (CBCs), and as such, we will have Customer and Account hubs.
?I’ve chosen to model Customer as a Hub/Satellite pair. However, I modelled Account as a Link Satellite hanging off the Account Link, which joins the Account and Customer Hubs.
Both Satellites refer to the Currency reference table. Notice the dotted lines indicating the relationships are optional; therefore, we do not set any referential integrity constraints, which is a general rule for Data Vault for reference tables.
So far, so good. Everything seems to make sense, and we’ve had no issues or difficult modelling decisions. Let’s now model the Exchange Rate.
We immediately hit a problem. I have used all the information from the source Exchange Rate table, but we don’t have anything left that we can use to hang off a Hub or Link. There are no recognised business keys in this data source.
How do we solve this problem? I can think of three possible solutions, none of which is ideal.
Option 1 - More Reference Data
Why not treat the Exchange Rate information as reference data? Something like this.
Is this really what we want to do? We now have reference data referencing more reference data; we are effectively snowflaking the reference data. It feels like a bad idea. Where do we stop with this approach? If we go down this path, we might as well abandon Data Vault altogether and aim for a normalised Inmon-style model.
Option 2 – Make it a CBC
Going in the opposite direction, why don’t we adopt Currency as a CBC and create a Hub for it??
That seems to work well. We have the Currency details stored in its own Satellite and the Exchange Rate information hanging from a Link that pairs together two Currencies.
But aren’t we forgetting something? What about the model we built for Account and Customer? If we refactor the Currency reference table as a regular Hub/Satellite combo, we will also have to change all other warehouse artefacts that referred to it before.
In our example, we would have to add Currency to the Account Link or create a new Account Link which includes Currency. We would also have to create a new Customer link that associates Customer and Currency and hang the Customer Satellite from it.
Suddenly, the model is starting to look a bit messy. If you think about this a bit more, you might consider how the model might evolve as we add more reference data. What if we had a source table that contained twenty different reference fields (not unheard of)? If we modelled them all as CBCs with a Hub for each, we’d need to build a link table containing more than twenty Hub references. The model is now getting to a point where it is beyond our control.
Option 3 – Why not both?
The third option is to represent Currency in the model as reference data and as a CBC. For the Exchange Rate data, we treat Currency as a CBC. For the Account and Customer data, we treat Currency as reference data.
If you pushed for an answer from the Data Vault Alliance, I would guess they would go with this option. But my primitive data modeller’s brain is screaming at me, telling me this is an accident waiting to happen. In effect, we are storing the same data twice, and we have no guarantees that the two versions of the data will be the same. ??
Three options, none of them ideal. Are there any other that I haven’t considered?
HOOK Solution
With HOOK, things are a little more straightforward. As before, we analyse each source and pick out the business keys or the keys that identify CBCs. The question, therefore, is whether Currency is a CBC or not. If we say it is, we will most likely end up with hundreds and hundreds of CBCs because reference data is ubiquitous; it is quite literally everywhere. We want to keep the number of CBCs down to a manageable size.
I discuss this in more detail when I explain the significance of Dunbar’s Number in my substack article on the Business Glossary.
HOOK uses a different approach to exposing key values for reference data that still looks and feels like a regular CBC Hook Key. We create a catch-all CBC called REF that we can use for all reference data types. REF is too broad and generic for us to consider it a CBC, so we flag it as a Weak CBC. Then to make it even more obvious, we can use a different Hook Key qualifier to distinguish them from regular CBC Hook Keys.
We will use the prefix “WK_” for Weak CBCs in this example. Regular CBC-based Hook Key will be prefixed, as before, with “HK_”. Our HOOK model will look like this:
Weak or otherwise, we construct the Hook Keys in the same way, and we must attach a KeySet. As the Currency in this example is from an ISO standard, I would probably assign it a KeySet {ISO4217}. If we chose to build the Exchange Rate Bag as a view, the SQL DDL would look something like this:
CREATE VIEW [BAG Exchange Rate
AS
SELECT WK_Ref__Currency__From = CONCAT(‘ISO4217|’, From_Currency_Code)
????, WK_Ref__Currency__To??= CONCAT(‘ISO4217|’, To_Currency_Code)
????, *
?FROM [Exchange Rate]]
As in last week’s article (HOOK vs Data Vault: Modelling) we can summarise the relationships using a Kimball style BUS matrix.
Analysis
Reference Data in Data Vault has always (to me) felt like it was an afterthought and is not part of the core Data Vault architecture. And it shows. There are situations when the same data looks like reference data and simultaneously like a legitimate business concept. Depending upon how you treat it, it has a profound impact on the target data model.
On the other hand, HOOK treats reference data like any other source of information. The only difference is that we group all reference data under a common pseudo-CBC called REF. We always need to ensure the number of CBCs remains a manageable number.
Conclusions
What conclusions can we draw? There are cases where data looks like reference data, some cases where it doesn’t and others where it could be either. For Data Vault this is problem and can lead to wildy different models and, therefore, the need to make tough design decisions.?
And with HOOK? By introducing the concept of Weak CBCs we can accomodate reference data in the same way we would with any other type of data. There are no tricky design decisions here. The only thing we have to worry about is what are the core business concepts and which of them can be categorised as reference type data.?