登录查看更多内容

Data Vault Modelling - Part 2

Rajib Deb

A technology leader specializing in data, AI and analytics architecture

发布日期: 2020年12月19日

This a sequel to my earlier post where I introduced the concept of data vault modelling. As I learn more on this, I wanted to share this with an wider audience to leverage the intelligence of the mass. I do not claim to be an expert on this topic yet, but I am trying to by practicing it. And my interest in this has grown many fold now because, theoretically, I see a value in this technique. I need to validate this value by actually practicing it now.

In this article, we will take an example of a fictitious utility company to go little deep into the concepts. The use case is as below:

An utility company(ABC electric and gas) has assets deployed in many parts of the state for power transmission to domestic and commercial establishments. An example of an asset can be a transformer or a camera which takes pictures of the devices and transmits them at regular intervals to ensure that the device is in good condition. These assets need inspection regularly and the inspection details need to be logged for auditing purpose. So, clearly we have two business objects here, one is an ASSET and the other one is INSPECTION. We will continue our vault modelling with these two objects

What attribute does the ASSET object capture?

Asset_Id - This is an unique id that identifies the asset
Asset_Type - What is the type of the asset? Is it a Camera or a Transformer or something else?
Manufacturer - Who is the manufacturer of the device
Model - What is the model of the device

There are many other attributes that it captures, but these are sufficient for understanding the concept and also these are generalized not to share any confidential information

What attributes does the INSPECTION object capture?

Inspection_Start_Date - When did the asset inspection started?
Inspection_End_Date - When did the asset inspection ended?
Insepction_performed_by - Who performed the inspection?

Again, there are many attributes but these are more than enough for the conceptual understanding

Let us recall, what were the core items in a Data Vault

HUB
SATELLITE
LINK

HUBs are the initial ones that you model in a data vault modelling technique. Think of it as the component that helps you integrate the many data sources(System of Records) in the data warehouse. One of the key thing that we should understand here is that, if the HUB content need to integrate the source system, the HUB content must be source system agnostic. They should therefore, have a key which is meaningful and understood by all. Those are the business keys which are not tied to any source system and has a common understanding across the enterprise.

HUBs, therefore is the store of all the business keys, the keys that BUSINESS uses to search and retrieve data. In our use case therefore, we need to have two HUBs - one for the ASSET and the other for the INSPECTION. Now comes the next complexity, under ASSETs, we capture many types of assets(CAMERA, TRANSFORMER, etc.). Do we create a single HUB for asset or one HUB for each of the asset. According to the book "Building a Scalable Data Warehouse With Data Vault 2.0' by Daniel Linstedt and Michael Olschimke, hubs are defined using a unique list of business keys which are supposed to have the same semantic meaning. Hence the recommendation is to have the HUBs by asset type. So, in this case, I created a CAMERA_HUB as below

CAMERA_HUB

HUB_CAMERA_ID_KEY - This is a hash of the below two business keys
ASSET_ID - BUSINESS KEY 1
ASSET_TYPE - BUSINESS KEY 2
LOAD_DATE_TIMESTAMP - This column tells us when the EDW system knew about this business key for the first time
RECORD_SOURCE - This column tells us which source system populated this row. In case of multiple sources, this will have the first source which populated the row

The primary key of the HUB is usually calculated by taking a hash of the business key(s). The business key(s) must be declared as unique. So, that means there will be only one row for a particular business key(either a single column or a compound key).

Next, I created the INSPECTION hub as below

INSPECTION_HUB

INSPECTION_LOG_ID_HASH - This is the hash of the business key
INSPECTION_LOG_ID - This is the business key
LOAD_DATE_TIMESTAMP - This column tells us when the EDW system knew about this business key for the first time
RECORD_SOURCE - This column tells us which source system populated this row. In case of multiple sources, this will have the first source which populated the row

Next is the LINK table. I will now need to link the two HUBs, CAMERA and INSPECTION. The link is the main component which makes this model flexible and extensible. It helps to allow and adapt for changes in the future to add new objects or relationships without changing anything that we have already developed. This is what gives the incremental development ability of the data warehouse. IN this particular case, I will now create the LINK table as below

LINK-CAMERA

LNK_CAMERA_LOG_HASH_KEY - This will be a hash of #2 and #3
HUB_CAMERA_ID_KEY - Primary key from CAMERA HUB, this will be a foreign key here
INSPECTION_LOG_ID_HASH - Primary key from INSPECTION HUB, this will be a foreign key here
ASSET_ID - Business key of CAMERA HUB
ASSET_TYPE - Business key of CAMERA HUB
INSPECTION_LOG_ID - Business key of INSPECTION HUB
LOAD_DATE_TIMESTAMP - This column tells us when the EDW system knew about this business key for the first time
RECORD_SOURCE - This column tells us which source system populated this row. In case of multiple sources, this will have the first source which populated the row

So, the LINK table will contain, the hash key which will be a hash of the primary keys from the HUB tables, the primary keys from the HUB tables, the BUSINESS keys of the HUBs and the additional metadata columns. A Link must have at-least two HUBs. A link table is similar to a FACT table in a dimensional modelling as its granularity is defined by the number of parent keys it has.

And now the last component is the SATELLITE component. Satellites are connected to either a LINK table or a HUB table. It contains the descriptive data of a HUB or a LINK. And this is where the change and the history of the data is stored. The schema of the Satellite is like a SCD Type 2(Slowly changing dimension). If any of the descriptive information changes in the table, it does not update it closes that row and inserts a new row. The satellite tables therefore are modelled as below

SATELLITE-CAMERA

HUB_CAMERA_ID_KEY - Primary key of the HUB
LOAD_DATE_TIMESTAMP - The day when this row was loaded
LOAD_END_DATE_TIMESTAMP - The day the row was changed and closed
MANUFACTURER - The descriptive column of the camera
MODEL - The descriptive data of the camera

SATELLITE-INSPECTION

INSPECTION_LOG_ID_HASH - Primary key of the HUB
LOAD_DATE_TIMESTAMP - The day when this row was loaded
LOAD_END_DATE_TIMESTAMP - The day the row was changed and closed
INSPECTION_START_DATE - Descriptive column of the HUB
INSPECTION_END_DATE - Descriptive column of the HUB
INSPECTION_START_DATE - Descriptive column of the HUB
INSPECTION_PERFORMED_BY - Descriptive column of the HUB

One more column that I have not included above in the satellite table is the HASH-DIFF key. To be DV 2.0 compliant, you must have this column calculated by taking a HASH of all the descriptive columns of a row(not the metadata columns ) and store it in the table. This column tells if the row changed from the time it was previously loaded in the data vault

To conclude, this modelling technique seems promising and has the features to help modelling a data warehouse incrementally. When combined with agile concepts, this technique definitely has the potential to become a powerful technique. I would be eager to know from the practitioner community what their experiences have been so far with this technique.

Tamanna Gupta

Data Leader Driving Business Transformation through Advanced Analytics and AI | Database Architecture | Data Mesh |Data Lake |Snowflake -Data Cloud Certified Architect

4 年

Hi Rajib, Your writings are always insightful. In my current project we are using Data Vault 2.0.

Dibyajyoti Sarkar

Data Warehouse Lead at The Global Fund

4 年

As a practitioner of data vault modeling, I must say that it is very well written. ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Rajib Deb的更多文章

Diving deep into Magentic-One...

2024年12月29日

Diving deep into Magentic-One...

I did a deep dive into the Magentic-One(SingleThreadedAgentRuntime) code today. I wanted to understand how it has been…
The Rise of Agentic Architecture: Reflecting on 2024 and Envisioning the Future

2024年12月26日

The Rise of Agentic Architecture: Reflecting on 2024 and Envisioning the Future

2024: The Dawn of the Agentic Architecture Era This year, 2024, will likely be remembered as one of the most…

5 条评论
Magentic-One | An instantiation of thinker/actor pattern...

2024年12月26日

Magentic-One | An instantiation of thinker/actor pattern...

Multi-Agent systems are evolving to not only process information, but also act on it with human supervision…

1 条评论
Thinker/Actor pattern | Leveraging the reasoners in a multi-agent system...

2024年11月30日

Thinker/Actor pattern | Leveraging the reasoners in a multi-agent system...

..
Convergence of symbolic and connectionist AI...

2024年11月29日

Convergence of symbolic and connectionist AI...

..
Amazon Bedrock Flows...

2024年11月24日

Amazon Bedrock Flows...

When new technology arrives, the early years often demand significant custom effort to make it work for specific use…
Taxonomy, Ontology and Knowledge Graph...

2024年11月17日

Taxonomy, Ontology and Knowledge Graph...

..
The evolution from web of documents to web of knowledge...

2024年11月10日

The evolution from web of documents to web of knowledge...

..
Context is the king...

2024年11月10日

Context is the king...

..

1 条评论
Language is not enough...

2024年11月9日

Language is not enough...

As human, we don't just speak a language, we use language to convey our knowledge on a certain subject based on the…

See all articles

Data Vault Modelling - Part 2

Rajib Deb

A technology leader specializing in data, AI and analytics architecture

Rajib Deb的更多文章

社区洞察

其他会员也浏览了

Top 10 Data Cleaning Techniques You Should Know in 2025

Step One: Collection and Preparation

Optimizations: Trendlines

Gather, Filter, Transform, Pass (GFTP) Model for Data Analysis Coding

Data Collection and Integration

How do you make useful data?

?? Unraveling the Mystery of Inserting Embedded Files into Power Query??

Unlocking Data Insights with Correlation and Regression Analysis: A Step-by-Step Guide for Building a Data-Driven Organization

GETTING COMFORTABLE WITH DATA AND DATA ANALYSIS: Let's Talk About Data

Rajib Deb的更多文章

Diving deep into Magentic-One...

The Rise of Agentic Architecture: Reflecting on 2024 and Envisioning the Future

Magentic-One | An instantiation of thinker/actor pattern...

Thinker/Actor pattern | Leveraging the reasoners in a multi-agent system...

Convergence of symbolic and connectionist AI...

Amazon Bedrock Flows...

Taxonomy, Ontology and Knowledge Graph...

The evolution from web of documents to web of knowledge...

Context is the king...

Language is not enough...

社区洞察

其他会员也浏览了

Top 10 Data Cleaning Techniques You Should Know in 2025

Step One: Collection and Preparation

Optimizations: Trendlines

Gather, Filter, Transform, Pass (GFTP) Model for Data Analysis Coding

Data Collection and Integration

How do you make useful data?

?? Unraveling the Mystery of Inserting Embedded Files into Power Query??

Unlocking Data Insights with Correlation and Regression Analysis: A Step-by-Step Guide for Building a Data-Driven Organization

GETTING COMFORTABLE WITH DATA AND DATA ANALYSIS: Let's Talk About Data