Data Vault Constructs: Hubs (Modern Data Warehousing, Part 4)
Christian Kaul
Data Modeling Aficionado and Senior Technical Consultant at virtual7 GmbH
This article is the fourth part in an ongoing series on modern data warehousing using data vault. ?The first part on data warehouse layers can be found here, the second part on kinds of time can be found here and the third part on number of timelines can be found here.
Data Vault Constructs
Unified Decomposition
Data vault modeling is based on unified decomposition. This means that data is split into different types of database objects that contain
- unique identifiers for concepts (hubs),
- connections between concepts (links) and
- details about concepts (satellites)
but held together by the common unique identifiers.
This is sometimes called passive integration (the data is integrated not by actively shoehorning it into a single database object but passively via the common unique identifier stored in the different database objects).
Mandatory and Optional Constructs
While you can (and definitely should try to) store all the data in the core layer of your data warehouse using these three mandatory constructs, there are some optional constructs that might occasionally be helpful as well.
Just keep in mind that one of the main strengths of data vault is the focus on a small number of repeatable patterns. Don’t jeopardize this by using too many different types of constructs just because one or the other seems to be the easiest way out in a certain situation.
In any case, you should pay special attention to your hubs and links. Together, they form the backbone of your data vault model.
Mandatory Construct #1: Hub
A hub is a database object (usually a table) that contains a list of unique human-readable identifiers for instances of a certain concept.
Concepts
These concepts can include everything that is of interest for an organization (like Customer, Employee, Product, Sale, …). While many of them will be core business concepts that are essential to the functioning of the organization, others might be driven by source systems using different concepts than the people working for the organization or by the need to identify instances of connections between concepts.
The instances of a concept stored in a hub are timeless and never change. Any details about them that may change over time are stored in satellite tables attached to the hub.
Human-readable Identifiers
The essential component of a hub is a human-readable string that uniquely identifies an instance of a concept (e.g., one particular Customer) across the whole organization. While we always hope to find a single attribute in the source systems that contains this string, it often doesn’t exist.
To remedy this problem, we have two main tools available:
- composite identifiers and
- namespaces.
Composite Identifiers
A composite identifier is a concatenation of attributes that uniquely identifies an instance of a concept. An Order Line Item, for example, might be uniquely identified by the string 4711* that consists of the order ID 4711, the separator *&# and the sequential number 5.
A separator is necessary to distinguish the concatenation of, e. g., 4711 and 5 from the concatenation of 471 and 15. To make sure that other parts of an identifier aren’t mistaken for the separator, it is recommended to use an unusual combination of characters like, e. g., *&#.
Namespaces
Namespaces are necessary if the same identifier can identify different instances of the same concept in different contexts like different countries, different organizational units or different source systems. To arrive at an identifier that is unique across the whole organization, you need to add a namespace prefix.
If your customer numbers are only unique by country, for example, you have to add a prefix like an ISO 3166 country code, followed by the standard separator. The final identifier stored in your Customer hub will look something like DE*ێ.
Be careful not to abuse namespaces by just prefixing every identifier with the name of the source system where it came from by default. With this approach, you lose the passive-integration benefit of the hubs and create potentially many identifiers for the same instance that you will have to bring back together later, often with considerable effort.
Technical Columns
Apart from the unique human-readable identifier, a hub usually contains a few additional technical columns:
- a surrogate identifier of predictable length and structure (often a hash key derived from the human-readable identifier),
- the load time when the human-readable identifier was first loaded to the data warehouse,
- some kind of record source indicator (providing the name of the source system and the source data object in plain text or an identifier for looking up the record source) and
- some kind of load process or audit record identifier (for looking up details about the load process that put the record into the hub).
Future articles will describe the other data vault constructs (both mandatory and optional ones), different kinds of surrogate identifiers (with their respective advantages and drawbacks) and all kinds of technical columns. Stay tuned!
Consultant database, BI, data warehouse, data mart, cube, ETL, SQL, analysis, design, development, documentation, test, management, SQL Server, Access, ADP+, Kimball practitioner. JOIN people ON data.
4 年I like to distinguish between: A compound identifier is concatenated data e.g. a US phone number contains area code + exchange + local number in an EmployeePhonebook table. A composite identifier is composed of multiple columns which combination is used to uniquely identify each row in a table, e.g. EmployeeId and CourseNumber in a Participant table.