登录查看更多内容

Data Vault Constructs: Hubs (Modern Data Warehousing, Part 4)

Christian Kaul

Data Modeling Aficionado and Senior Technical Consultant at virtual7 GmbH

发布日期: 2021年2月15日

This article is the fourth part in an ongoing series on modern data warehousing using data vault. ?The first part on data warehouse layers can be found here, the second part on kinds of time can be found here and the third part on number of timelines can be found here.

Data Vault Constructs

Unified Decomposition

Data vault modeling is based on unified decomposition. This means that data is split into different types of database objects that contain

unique identifiers for concepts (hubs),
connections between concepts (links) and
details about concepts (satellites)

but held together by the common unique identifiers.

This is sometimes called passive integration (the data is integrated not by actively shoehorning it into a single database object but passively via the common unique identifier stored in the different database objects).

Mandatory and Optional Constructs

While you can (and definitely should try to) store all the data in the core layer of your data warehouse using these three mandatory constructs, there are some optional constructs that might occasionally be helpful as well.

Just keep in mind that one of the main strengths of data vault is the focus on a small number of repeatable patterns. Don’t jeopardize this by using too many different types of constructs just because one or the other seems to be the easiest way out in a certain situation.

In any case, you should pay special attention to your hubs and links. Together, they form the backbone of your data vault model.

Mandatory Construct #1: Hub

A hub is a database object (usually a table) that contains a list of unique human-readable identifiers for instances of a certain concept.

Concepts

These concepts can include everything that is of interest for an organization (like Customer, Employee, Product, Sale, …). While many of them will be core business concepts that are essential to the functioning of the organization, others might be driven by source systems using different concepts than the people working for the organization or by the need to identify instances of connections between concepts.

The instances of a concept stored in a hub are timeless and never change. Any details about them that may change over time are stored in satellite tables attached to the hub.

Human-readable Identifiers

The essential component of a hub is a human-readable string that uniquely identifies an instance of a concept (e.g., one particular Customer) across the whole organization. While we always hope to find a single attribute in the source systems that contains this string, it often doesn’t exist.

To remedy this problem, we have two main tools available:

composite identifiers and
namespaces.

Composite Identifiers

A composite identifier is a concatenation of attributes that uniquely identifies an instance of a concept. An Order Line Item, for example, might be uniquely identified by the string 4711*&#5 that consists of the order ID 4711, the separator *&# and the sequential number 5.

A separator is necessary to distinguish the concatenation of, e. g., 4711 and 5 from the concatenation of 471 and 15. To make sure that other parts of an identifier aren’t mistaken for the separator, it is recommended to use an unusual combination of characters like, e. g., *&#.

Namespaces

Namespaces are necessary if the same identifier can identify different instances of the same concept in different contexts like different countries, different organizational units or different source systems. To arrive at an identifier that is unique across the whole organization, you need to add a namespace prefix.

If your customer numbers are only unique by country, for example, you have to add a prefix like an ISO 3166 country code, followed by the standard separator. The final identifier stored in your Customer hub will look something like DE*&#1742.

Be careful not to abuse namespaces by just prefixing every identifier with the name of the source system where it came from by default. With this approach, you lose the passive-integration benefit of the hubs and create potentially many identifiers for the same instance that you will have to bring back together later, often with considerable effort.

Technical Columns

Es wurde kein Alt-Text für dieses Bild angegeben.

Apart from the unique human-readable identifier, a hub usually contains a few additional technical columns:

a surrogate identifier of predictable length and structure (often a hash key derived from the human-readable identifier),
the load time when the human-readable identifier was first loaded to the data warehouse,
some kind of record source indicator (providing the name of the source system and the source data object in plain text or an identifier for looking up the record source) and
some kind of load process or audit record identifier (for looking up details about the load process that put the record into the hub).

Future articles will describe the other data vault constructs (both mandatory and optional ones), different kinds of surrogate identifiers (with their respective advantages and drawbacks) and all kinds of technical columns. Stay tuned!

Joakim Dalby

Consultant database, BI, data warehouse, data mart, cube, ETL, SQL, analysis, design, development, documentation, test, management, SQL Server, Access, ADP+, Kimball practitioner. JOIN people ON data.

4 年

I like to distinguish between: A compound identifier is concatenated data e.g. a US phone number contains area code + exchange + local number in an EmployeePhonebook table. A composite identifier is composed of multiple columns which combination is used to uniquely identify each row in a table, e.g. EmployeeId and CourseNumber in a Participant table.

1 次回应

要查看或添加评论，请登录

Christian Kaul的更多文章

Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)

2022年6月29日

Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)

If you’re new to the topic or don’t have a lot of practical data vault experience, you might want to consult the…

4 条评论
Data Vault Conventions: Technical Columns (Modern Data Warehousing, Part 12)

2021年8月25日

Data Vault Conventions: Technical Columns (Modern Data Warehousing, Part 12)

This article is the twelfth part in an ongoing series on modern data warehousing using data vault. The first part on…
Data Vault Conventions: Surrogate Identifiers (Modern Data Warehousing, Part 11)

2021年8月17日

Data Vault Conventions: Surrogate Identifiers (Modern Data Warehousing, Part 11)

This article is the eleventh part in an ongoing series on modern data warehousing using data vault. The first part on…

4 条评论
Data Vault Conventions: Construct Naming Conventions (Modern Data Warehousing, Part 10)

2021年8月4日

Data Vault Conventions: Construct Naming Conventions (Modern Data Warehousing, Part 10)

This article is the tenth part in an ongoing series on modern data warehousing using data vault. The first part on data…

3 条评论
Data Vault Conventions: Construct Usage (Modern Data Warehousing, Part 9)

2021年7月22日

Data Vault Conventions: Construct Usage (Modern Data Warehousing, Part 9)

This article is the ninth part in an ongoing series on modern data warehousing using data vault. The first part on data…

1 条评论
Raw Vault and Business Vault (Modern Data Warehousing, Part 8)

2021年7月16日

Raw Vault and Business Vault (Modern Data Warehousing, Part 8)

This article is the eighth part in an ongoing series on modern data warehousing using data vault. The first part on…

1 条评论
Data Vault Constructs: Other Optional Constructs (Modern Data Warehousing, Part 7)

2021年4月22日

Data Vault Constructs: Other Optional Constructs (Modern Data Warehousing, Part 7)

This article is the seventh part in an ongoing series on modern data warehousing using data vault. The first part on…

5 条评论
Data Vault Constructs: Satellite Variations (Modern Data Warehousing, Part 6)

2021年3月31日

Data Vault Constructs: Satellite Variations (Modern Data Warehousing, Part 6)

This article is the sixth part in an ongoing series on modern data warehousing using data vault. The first part on data…

10 条评论
Data Vault Constructs: Links & Satellites (Modern Data Warehousing, Part 5)

2021年3月1日

Data Vault Constructs: Links & Satellites (Modern Data Warehousing, Part 5)

This article is the fifth part in an ongoing series on modern data warehousing using data vault. The first part on data…

7 条评论
Number of Timelines (Modern Data Warehousing, Part 3)

2021年1月29日

Number of Timelines (Modern Data Warehousing, Part 3)

This article is the third part in an ongoing series on modern data warehousing using data vault. The first part on data…

18 条评论

See all articles

Data Vault Constructs: Hubs (Modern Data Warehousing, Part 4)

Christian Kaul

Data Modeling Aficionado and Senior Technical Consultant at virtual7 GmbH

Data Vault Constructs

Unified Decomposition

Mandatory and Optional Constructs

Mandatory Construct #1: Hub

Concepts

Human-readable Identifiers

Composite Identifiers

Namespaces

Technical Columns

Christian Kaul的更多文章

社区洞察

其他会员也浏览了

Unlocking the future of data management with Data Vault

Understanding the Differences Between Snowflake and Star Schema in the Data Engineering Universe

My modern DWH pattern: The Analytical Information Factory (#AIF)

Fifth commandment of data-driven companies: You won't navigate through swamps

The Data Lakehouse Revolution: Transforming Modern Data Management

TOP FIVE DIFFERENCES BETWEEN DATA LAKES AND DATA WAREHOUSES

Understanding Slowly Changing Dimensions (SCD) in Data Warehousing

Mastering Slowly Changing Dimension with Hudi: A Step-by-Step Guide to Efficient Data Management

Decoding Data Warehousing Definitions: Kimball vs. Inmon

Navigating the World of Data Storage: Database vs. Data Warehouse vs. Data Lake vs. Data Lakehouse vs. Data Mart vs. Data Mesh ??????

Data Vault Constructs

Unified Decomposition

Mandatory and Optional Constructs

Mandatory Construct #1: Hub

Concepts

Human-readable Identifiers

Composite Identifiers

Namespaces

Technical Columns

Christian Kaul的更多文章

Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)

Data Vault Conventions: Technical Columns (Modern Data Warehousing, Part 12)

Data Vault Conventions: Surrogate Identifiers (Modern Data Warehousing, Part 11)

Data Vault Conventions: Construct Naming Conventions (Modern Data Warehousing, Part 10)

Data Vault Conventions: Construct Usage (Modern Data Warehousing, Part 9)

Raw Vault and Business Vault (Modern Data Warehousing, Part 8)

Data Vault Constructs: Other Optional Constructs (Modern Data Warehousing, Part 7)

Data Vault Constructs: Satellite Variations (Modern Data Warehousing, Part 6)

Data Vault Constructs: Links & Satellites (Modern Data Warehousing, Part 5)

Number of Timelines (Modern Data Warehousing, Part 3)

社区洞察

其他会员也浏览了

Unlocking the future of data management with Data Vault

Understanding the Differences Between Snowflake and Star Schema in the Data Engineering Universe

My modern DWH pattern: The Analytical Information Factory (#AIF)

Fifth commandment of data-driven companies: You won't navigate through swamps

The Data Lakehouse Revolution: Transforming Modern Data Management

TOP FIVE DIFFERENCES BETWEEN DATA LAKES AND DATA WAREHOUSES

Understanding Slowly Changing Dimensions (SCD) in Data Warehousing

Mastering Slowly Changing Dimension with Hudi: A Step-by-Step Guide to Efficient Data Management

Decoding Data Warehousing Definitions: Kimball vs. Inmon

Navigating the World of Data Storage: Database vs. Data Warehouse vs. Data Lake vs. Data Lakehouse vs. Data Mart vs. Data Mesh ??????