Modern Data Platform Architecture using Data Vault
Saikrishna Cheruvu
Cloud Architect ?? Data Engineer ?? Feature engineering ?? Unsupervised learning ??DataOps
The Data Vault approach, created by Dan Linstedt in the 1990s, is designed to make these benefits accessible to everyone. This was followed by Data Vault 2.0 in 2013, which provided a set of enhancements focused on NoSQL and Big Data as well as introduced integrations for unstructured and semi-structured data.
Linstedt's aim is to enable data architects and engineers to build data warehouses faster, i.e. with shorter deployment timeframes and in a way that more efficiently addresses business needs.
Data Vault not only helps us model our data efficiently, but it also provides us with a scalable and flexible multi-tier architecture.
Before directly jumping onto the data vault, how does this model work on the enterprises' data warehouse? Below is the fantasy view I created representing the modern data platform. Find the explanation for each tier.
Bronze layer: "data landing or Staging area"
This bronze layer's name is derived from the lake house design. Normally, database terminology such as staging layer or ODS is used (Operational data store). This layer is a landing zone for data from the data source. Generally, data scientists will extract data from the bronze layer. Data mining and other machine learning use cases can also be implemented using the bronze layer.
Data from sourcing and ingesting will be stored on the bronze layer. (The staging process should be automated, and the drop and create technique should be used every time. Then it can recreate the staging layer at any time).
Agenda for Automating the Process: We shouldn't be wasting time loading data into the Data Vault. The majority of our time should be spent collaborating with the business and implementing their requirements into Information Marts.
Sliver Layer: “cleansed and conformed data”
The data from the bronze layer is matched, merged, conformed, and cleansed?in the lakehouse's silver layer so that the silver layer can provide an "enterprise view" of all its key business entities, concepts, and transactions. Examples: ?Master customers, stores, non-duplicated transactions, and cross-reference tables
The Silver layer consolidates data from various sources into an enterprise view and enables self-service analytics for ad hoc reporting, advanced analytics, and machine learning. It is a source for departmental analysts, data engineers, and data scientists in the Gold Layer to create projects and analyses to answer business problems using enterprise and departmental data projects.
The Lakehouse design will follow the ELT method, so the transformations will be applied after the load and cleaning of the data, speed of the data process is the priory and delivery. Most of the business rules and derived attributes will be executed. A data vault will be implemented in this layer.
Valut implementation :
Vault design will be on three vaults creation
RAW :
Raw vault will be divided into three different tables or objects (Hubs, Links, Satellites)
Data is ingested into the raw layer directly from the staging layer, potentially directly into the raw layer when handling real-time data sources. No business rules will be applied to this raw vault.
This must be an automated process with no SQL handwriting suggested because each step of Hubs and Links Satellites will be connected. If it is not automated, the keys will not match. This is a crucial step in data vault creation.
HUB :
Core business data will be stored on HUB table examples. The facts and dimensions derived from the stat schema implemented tables will be the correct examples to convert the HUB tables.
Links :
A link defines the relationship between two or more hubs' business keys.
A link structure, like the hub, contains no contextual information about the entities. In addition, only one row should represent the relationship between two entities. To represent a defunct relationship, we would need to create a satellite table off of the link table that contains a deleted flag; this is known as an effectivity satellite.
One significant advantage Data Vault has over other data warehousing architectures is the ease with which relationships can be added between Hubs. Data Vault prioritizes agility and implementing what is required to meet current business objectives. If relationships aren't currently known or data sources aren't yet available, that's fine because links can be easily created as needed. Adding a new link has no effect on existing hubs or satellites.
领英推荐
Satellites:
A satellite in the Data Vault architecture stores all of an entity's contextual information.
Data in my industry is constantly changing. How will non-volatile contextual tables help me?
When the data changes, a new row with the updated information must be added. The hash key and one of the Data Vault-mandated fields, the load date, are used to distinguish these records. The load data enables us to determine the most recent record for a given record.
In the preceding example, we see two records with the same emp hash. The most recent record, as defined by the load date, corrects a typo in the emp name field.
But won't it take an eternity to figure out what has changed between the source and the Data Vault?
No, using a content hash makes this extremely fast. While a Data Vault model is optional, it provides a significant advantage when examining records that have changed between source and target systems.
When populating the Data Vault Staging area (more on this below), the content hash is computed and uses all relevant contextual data fields. When any of these contextual data fields are updated, a new content hash is generated. This enables us to detect changes quickly. This is most commonly accomplished with an outer join, depending on the technology used, though some systems offer even more optimized techniques.
Satellites are created to aid in differentiation based on data source and rate of change. In general, you would create a new satellite table for each data source and then further separate data from those sources that change frequently. Separating high and low-frequency data attributes can help with ingestion throughput and reduce the amount of space that historical data takes up. Separating the attributes by frequency is optional, but it can provide some benefits.
Data classification is another common consideration when developing satellites. Data can be separated using satellites based on classification or sensitivity. Physically separating data elements makes it easier to handle special security considerations.
Gold layer or Information delivery / downstream
The recipe you prepared on the above silver layer will be served on this gold layer in an organized way now data will be created on Data Marts. There are three types of data marts that will be segregated.?
Information marts or data marts:?
Informational Marts are where business users will finally have access to the data. All business rules and logic are now applied to these Marts. each mart might be a database or individual table. this mart is the end product of the the application.
either business users or ETL users will utilize it to create their models or reports and an analytics layer will be performed on this layer.
Error Marts
Error Marts are an optional layer in the Data Vault that can be useful for surfacing data issues to business users. Remember that all data, correct or not, should remain as historical data in the Data Vault for audit and traceability.
Metrics Marts
The Metrics Mart is an optional tier used to surface operational metrics for analytical or reporting purposes.
Conclusion
I hope you are aware of the traditional modeling techniques at play (support for transactions in third normal form and analytics in dimensional modeling). We've also seen that the data vault methodology is all about responding to changing business requirements faster while eliminating the cost of refactoring that was previously associated with dimensional modeling. The main takeaway is that each modeling technique is tailored to a specific purpose. It goes without saying, then, that choosing one technique over another and reaping the benefits is largely determined by the business requirements.
ref links :
if any issues or mistakes in this write-up, please comment and I will edit the page.
Thank you!
Senior Software Engineer at DXC Technology
1 年Superb. ??
Vice President Technology Manager
2 年Good One. ??????
Excellent work, Saikrishna Cheruvu