Always try to boil the ocean!

Always try to boil the ocean!

Recently, I have felt very left out by all these Snowflake folk with their superhero avatars. As I don’t work for Snowflake, I felt I had no other option but to make my own. The best I could come up with (using the emoji designer on my iPad) was this comical-looking green ninja, who I’m sure will have some controversial opinions to share with you in the future. I also considered adding “Data Ninja” to my LinkedIn tagline, but honestly, that seems so pretentious. So, for all you Data Artisans, Charmers, Gurus, Liberators, Magicians, Philosophers, Pioneers, Warriors and Whisperers (you know who you are): You are better than that, so please stop.

OK, now that I’ve alienated half of my audience, it’s time to upset the rest...


I’ve attended my fair share of Data Vault webinars and contributed to Data Vault forums over the years, and the message I hear more than any other is:

“Never try to boil the ocean!!”

Roughly translated, this golden nugget of advice is telling us not to try and build the whole data warehouse in one go but rather to develop it incrementally in tiny pieces. I think they call it being agile. I often hear the advice, particularly from the experts at the Data Vault Alliance, that in each iteration, you should take only a handful of tables, build one or two hubs and a link if necessary, and look to iterate in short two-day sprints. Wow. A two-day sprint to deliver maybe half a dozen tables: A truly magnificent achievement!

Let me just check my backlog. It seems that for this particular source, I have six hundred tables to ingest and by my rough calculations, that will take the best part of two hundred working days. Oh, and I have these other twenty systems I need to do the same for. Yes, we are agile, but we aren’t going to get anywhere close to chewing through that backlog any time soon.

But I have to ask, what is the end goal? Because the way I see it, we actually DO want to boil the ocean, and we need a method that allows us to do that as quickly as possible. One cup at a time isn't going to cut it; we need to scale up to industrial-sized buckets and load as much data as possible.

OK, with that in mind, let’s see how the Data Ninja gets on discussing this thorny issue with a Data Vault "expert":

I’m beginning to like this Data Ninja fellow; he makes a lot of sense.


If, storage is cheap (as we are often told), then surely, we should try to store as much of it as we possibly can at the earliest opportunity, because we only get one chance to do that. Every day, week, month or year of delay, we are effectively throwing away data that might have been extremely valuable to the business.

But it seems that if we want to be agile and build out the warehouse in small increments, then the modelling activities can be a bottleneck. How can we remove this bottleneck?

What if you didn’t need to model the data upfront? What if you could apply a model after the data has been ingested? What if you could change that model on a whim without reloading any data? Wouldn’t that be something?

With HOOK, that is precisely what you can do. With a few simple tools, which you can (as I have) build yourself and metadata from the source system, it is easy to tag all the source tables and generate the necessary data artefacts in a matter of minutes. Two days for six tables? What about two hours for six hundred tables? Very, very doable.


To give you an idea, let’s see how easy this could be. To save a bit of effort, I will repurpose the example I used in my last article on the Unified Star Schema. If you recall, there were four tables from the HR system: Person, Employee, Position and Employment History, as shown in Figure 1.

Figure 1

Let’s assume that we have access to the metadata for these tables, and we can access it in tabular format as shown in Table 1.

Table 1

There is a row in the table for each column, within each of the four source system tables.

  • Name of the table
  • Name of the column
  • The data type for the column
  • Whether the column is part of the primary key of the table
  • An indication if the column references the primary key in another table (foreign key).

There might be other column descriptors, such as the column’s ordinal position within its table, but I’ve left those out as I don’t need them. I haven’t listed all the columns, so assume that we would have a complete list of columns for all the tables. I have first-hand experience with this particular system, which contains 479 tables in its database with a total of 12,879 columns. That’s quite a lot of metadata.

How do we Hookify this data?

That’s the second time I’ve used the word “hookify” in as many articles. I should probably define it.

To hookify each source table, we need to formally expose their business keys. In preparation for that, let’s add a couple more columns to the metadata table, as shown in Table 2.

Table 2

I’ve added two more columns (shaded in green). The first is to record the Core Business Concept (CBC) for a business key, and the second is the KeySet identifier, which we will use to qualify the business key. As discussed in the last article, there are three CBCs that we care about: Person, Employee and Position.

We can now inspect our metadata and see where we might have some business keys. If we consider business keys for Person, then we can see that any columns named Person_Id look like they might be business keys. If that is true, and we trust that the source system uses a consistent naming convention throughout its database, then we should be able to pick out all Person business keys, as shown in Table 3.

Table 3

For each column named Person_Id, I have assigned the CBC of Person and a KeySet identifier of {HR.PER}. And we can do the same thing for the Employee and Position CBCs, identify Empoyee_No and Position_Id as business keys, as shown in Table 4.

Table 4

You can see now that we can define multiple business keys for each source table. Using this metadata and the tagging information for the business keys, we can easily generate SQL data definition language (DDL) for the HOOK bags, as follows.

CREATE VIEW bag.Person 
AS
SELECT HK_Person   = ‘HR.PER|’ + Person_Id
     , *
  FROM lake.Person
-----------------------------------------------------
CREATE VIEW bag.EMployee 
AS
SELECT HK_Employee = ‘HR.EMP|’ + Employee_No
     , HK_Person   = ‘HR.PER|’ + Person_Id
     , *
  FROM lake.Employee
-----------------------------------------------------
CREATE VIEW bag.Position 
AS
SELECT HK_Position = ‘HR.POS|’ + Position_Id
     , *
  FROM lake.Position
-----------------------------------------------------
CREATE VIEW bag.Employment 
AS
SELECT HK_Employee = ‘HR.EMP|’ + Employee_No
     , HK_Position = ‘HR.POS|’ + Position_Id
     , *
  FROM lake.Employment_History        

Of course, this is a simple example. If the naming conventions aren’t consistent, you might have to look for alternative names. For example, some of the employee fields might be called emp_id or included in longer column names, such as primary_empid, requiring fuzzy match searches to find the business key columns.

But even then, it is still possible to miss some business keys. There will always be edge cases that don’t fit the expected patterns. That is the beauty and power of the HOOK approach because if we discover more business keys later, we can quickly drop and recreate the necessary data warehouse artefacts without touching the underlying raw data.


As you have seen, creating a HOOK data warehouse using simple metadata tagging and code generation techniques is a relatively simple matter. The tools are easy to build and allow us to create large numbers of data warehouse artefacts in bulk. So, in data warehousing terms, it is possible to boil the ocean.

If you still don’t believe me, then the Data Ninja might have something to say about that…




Nick Pinfold

Principal Data Analyst at Wellington City Council

1 年

Great article, data dictionary /metadata is the best way to acquire In bulk and at scale and the same metadata can usually be used to automate the testing and release process. Used dictionary to metadata many times and because you prove the method the code and data is better than hand coding. Only issue is an untrusting release process who will not let into production as it's untested by users even if automated testing has been completed. The question I have is how do you convince management to release 600 tables in one go or agreed sub batches? This is a template approach to acquisition which is now been commercialized as an extract and load tool. Am seeing the approach of 1. Extract and load via tool, translate via another tool, and model for simplified or performing "consumer* end user layer.

Ronald Kunenborg

Enterprise Data Architect, data expert, coach and data modeller

1 年

The whole problem with this article is that it does try to boil the ocean and in doing so violates a number of best practices for no gain that I can see. The number of anti-patterns to Agile working practices is still large in number. No need to add new ones to the existing collection.

回复
Martijn Evers

Enterprise Knowledge Facilitator/Philosopher, Full Scale (Holistic) Knowledge and data Architect

1 年

The issue is more problematic than you think. Both hook and data Vault can be used to create an enhanced version of a persistent historical data stage. Using model automation the arguments between these 2 are totally moot. Without tools I think I would use hook or a similar simpeler PSA approach. Data Vault can be used in lots of other ways as well (both a strength and a weakness). Hook is optimized for a PSA role.

Patrik Lager

Chief Data Architect at Daana

1 年

In comparison,I think, is that you are creating a Raw Vault (Hub/Link) according to business concepts, attributes according to source system.. But what about the Business Vault? In my opinion, the Raw Vault is a glorified Persistent Staging, totally unnecessary, the staging can be persistent according to source (aka Data Lake) but the true value as an analytical storage system comes when you create a fully integrated business oriented layer, which is semantically and structural source system agnostic in its Data representation. You know that the data lake people has had the same approach as you but 10-15 years ago. Drop as many tabels into the laje as fast as possible, then do schema on read. How did that work out?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了