Data Vault Modelling - Part 2

This a sequel to my earlier post where I introduced the concept of data vault modelling. As I learn more on this, I wanted to share this with an wider audience to leverage the intelligence of the mass. I do not claim to be an expert on this topic yet, but I am trying to by practicing it. And my interest in this has grown many fold now because, theoretically, I see a value in this technique. I need to validate this value by actually practicing it now.

In this article, we will take an example of a fictitious utility company to go little deep into the concepts. The use case is as below:

An utility company(ABC electric and gas) has assets deployed in many parts of the state for power transmission to domestic and commercial establishments. An example of an asset can be a transformer or a camera which takes pictures of the devices and transmits them at regular intervals to ensure that the device is in good condition. These assets need inspection regularly and the inspection details need to be logged for auditing purpose. So, clearly we have two business objects here, one is an ASSET and the other one is INSPECTION. We will continue our vault modelling with these two objects

What attribute does the ASSET object capture?

  1. Asset_Id - This is an unique id that identifies the asset
  2. Asset_Type - What is the type of the asset? Is it a Camera or a Transformer or something else?
  3. Manufacturer - Who is the manufacturer of the device
  4. Model - What is the model of the device

There are many other attributes that it captures, but these are sufficient for understanding the concept and also these are generalized not to share any confidential information

What attributes does the INSPECTION object capture?

  1. Inspection_Start_Date - When did the asset inspection started?
  2. Inspection_End_Date - When did the asset inspection ended?
  3. Insepction_performed_by - Who performed the inspection?

Again, there are many attributes but these are more than enough for the conceptual understanding

Let us recall, what were the core items in a Data Vault

  1. HUB
  2. SATELLITE
  3. LINK

HUBs are the initial ones that you model in a data vault modelling technique. Think of it as the component that helps you integrate the many data sources(System of Records) in the data warehouse. One of the key thing that we should understand here is that, if the HUB content need to integrate the source system, the HUB content must be source system agnostic. They should therefore, have a key which is meaningful and understood by all. Those are the business keys which are not tied to any source system and has a common understanding across the enterprise.

HUBs, therefore is the store of all the business keys, the keys that BUSINESS uses to search and retrieve data. In our use case therefore, we need to have two HUBs - one for the ASSET and the other for the INSPECTION. Now comes the next complexity, under ASSETs, we capture many types of assets(CAMERA, TRANSFORMER, etc.). Do we create a single HUB for asset or one HUB for each of the asset. According to the book "Building a Scalable Data Warehouse With Data Vault 2.0' by Daniel Linstedt and Michael Olschimke, hubs are defined using a unique list of business keys which are supposed to have the same semantic meaning. Hence the recommendation is to have the HUBs by asset type. So, in this case, I created a CAMERA_HUB as below

CAMERA_HUB

  1. HUB_CAMERA_ID_KEY - This is a hash of the below two business keys
  2. ASSET_ID - BUSINESS KEY 1
  3. ASSET_TYPE - BUSINESS KEY 2
  4. LOAD_DATE_TIMESTAMP - This column tells us when the EDW system knew about this business key for the first time
  5. RECORD_SOURCE - This column tells us which source system populated this row. In case of multiple sources, this will have the first source which populated the row

The primary key of the HUB is usually calculated by taking a hash of the business key(s). The business key(s) must be declared as unique. So, that means there will be only one row for a particular business key(either a single column or a compound key).

Next, I created the INSPECTION hub as below

INSPECTION_HUB

  1. INSPECTION_LOG_ID_HASH - This is the hash of the business key
  2. INSPECTION_LOG_ID - This is the business key
  3. LOAD_DATE_TIMESTAMP - This column tells us when the EDW system knew about this business key for the first time
  4. RECORD_SOURCE - This column tells us which source system populated this row. In case of multiple sources, this will have the first source which populated the row

Next is the LINK table. I will now need to link the two HUBs, CAMERA and INSPECTION. The link is the main component which makes this model flexible and extensible. It helps to allow and adapt for changes in the future to add new objects or relationships without changing anything that we have already developed. This is what gives the incremental development ability of the data warehouse. IN this particular case, I will now create the LINK table as below

LINK-CAMERA

  1. LNK_CAMERA_LOG_HASH_KEY - This will be a hash of #2 and #3
  2. HUB_CAMERA_ID_KEY - Primary key from CAMERA HUB, this will be a foreign key here
  3. INSPECTION_LOG_ID_HASH - Primary key from INSPECTION HUB, this will be a foreign key here
  4. ASSET_ID - Business key of CAMERA HUB
  5. ASSET_TYPE - Business key of CAMERA HUB
  6. INSPECTION_LOG_ID - Business key of INSPECTION HUB
  7. LOAD_DATE_TIMESTAMP - This column tells us when the EDW system knew about this business key for the first time
  8. RECORD_SOURCE - This column tells us which source system populated this row. In case of multiple sources, this will have the first source which populated the row

So, the LINK table will contain, the hash key which will be a hash of the primary keys from the HUB tables, the primary keys from the HUB tables, the BUSINESS keys of the HUBs and the additional metadata columns. A Link must have at-least two HUBs. A link table is similar to a FACT table in a dimensional modelling as its granularity is defined by the number of parent keys it has.

And now the last component is the SATELLITE component. Satellites are connected to either a LINK table or a HUB table. It contains the descriptive data of a HUB or a LINK. And this is where the change and the history of the data is stored. The schema of the Satellite is like a SCD Type 2(Slowly changing dimension). If any of the descriptive information changes in the table, it does not update it closes that row and inserts a new row. The satellite tables therefore are modelled as below

SATELLITE-CAMERA

  1. HUB_CAMERA_ID_KEY - Primary key of the HUB
  2. LOAD_DATE_TIMESTAMP - The day when this row was loaded
  3. LOAD_END_DATE_TIMESTAMP - The day the row was changed and closed
  4. MANUFACTURER - The descriptive column of the camera
  5. MODEL - The descriptive data of the camera

SATELLITE-INSPECTION

  1. INSPECTION_LOG_ID_HASH - Primary key of the HUB
  2. LOAD_DATE_TIMESTAMP - The day when this row was loaded
  3. LOAD_END_DATE_TIMESTAMP - The day the row was changed and closed
  4. INSPECTION_START_DATE - Descriptive column of the HUB
  5. INSPECTION_END_DATE - Descriptive column of the HUB
  6. INSPECTION_START_DATE - Descriptive column of the HUB
  7. INSPECTION_PERFORMED_BY - Descriptive column of the HUB

One more column that I have not included above in the satellite table is the HASH-DIFF key. To be DV 2.0 compliant, you must have this column calculated by taking a HASH of all the descriptive columns of a row(not the metadata columns ) and store it in the table. This column tells if the row changed from the time it was previously loaded in the data vault

To conclude, this modelling technique seems promising and has the features to help modelling a data warehouse incrementally. When combined with agile concepts, this technique definitely has the potential to become a powerful technique. I would be eager to know from the practitioner community what their experiences have been so far with this technique.

Tamanna Gupta

Data Leader Driving Business Transformation through Advanced Analytics and AI | Database Architecture | Data Mesh |Data Lake |Snowflake -Data Cloud Certified Architect

4 年

Hi Rajib, Your writings are always insightful. In my current project we are using Data Vault 2.0.

回复
Dibyajyoti Sarkar

Data Warehouse Lead at The Global Fund

4 年

As a practitioner of data vault modeling, I must say that it is very well written. ??

要查看或添加评论,请登录

Rajib Deb的更多文章

社区洞察

其他会员也浏览了