SAS DI

SAS DI


Using SAS DI Studio To Load A Data Vaul
 Posted 10-22-2014 04:25 AM | by LinusH (8509 views)



Introduction

Data Vault has over the last years becoming popular as modelling technique for data warehouse. My experience is that it has not been used widely (yet) within SAS implementations. So my questions are: Could you use SAS Data Integration for this? Is it recommended?

This article is targeted to audiences with little/no knowledge of Data Vault, nor SAS Data Integrations Studio. So bare with me if you feel that some parts seem abundantly clear.

 

What is SAS Data Integration Studio?

SAS Data Integration Studio is an ETL tool offered by SAS Institute, and is a part of their Data Management portfolio. It lets you build and maintain metadata for databases, entities and jobs. It contains a set of standard transformations that help you with copy, map, transform and load your data. The metadata is the basis for deploying job code, which is in SAS Language. The SAS Language is a combination of:

ANSI SQL with SAS extensions
Vendor RDBMS specific SQL
Data step language
Extensive function library
Procedures for standard data and statistical tasks
Macro language to create parametrized and reusable code

 

What is Data Vault?

Data Vault is a data modelling paradigm, first presented by Dan Linstedt. It is aimed for the atomic layer of a data warehouse. And it was developed to answer some of the problems that modelling a data warehouse using the Third normal form (3NF).

The simplest description on what Data Vault is, is to describe the three core types of entities:

Hub: identifies a business concept. Contains only the business key for that concept (alongside with standard technical attributes such as Load Time, Source System Code and of course a surrogate key (SK)). No versioning – no valid from/to dates.
Link: establishes relationships between hubs. Always describes M-M, regardless of the nature of current real world relationship. No versioning.
Satellite: contains descriptive data; codes, names, numerical values, business dates etc. This data is always versioned. A Satellite can only have a relationship to one mother table: a hub or a link (less common).

There is also an expressed best practice that you can have two persistent layers in a data vault warehouse:

Core Data Vault: Reads all data, stores all data. No transformation other than maintaining business keys, and mapping to common column names and data types. Considered as the “fact” of the data warehouse
Business Data Vault: also an atomic layer, but is the result after applying common business and transformation rules. These could include calculations, common code structures, data quality initiatives, customer segmentation and other statistical models.

 

Why Use Data Vault?

There are several benefits of using the Data Vault (DV) paradigm over 3NF. Some of them are (not limited to):

Flexible: when adding new data sources to your DW, usually no need to redesign existing tables – just add new
Agile: the flexibility lets you build small portions in each sprint. Extensions to the model usually means separate Hubs and Links.
Predictable loading pattern. 3NF DW usually needs many waves to load, so it can maintain referential integrity. As an example, the SAS model for the Insurance industry needs 23 waves. A Data Vault only needs 3! How can that be - there is no exact corresponding version of the SAS Insurance industry model? Well because that a Data Vault you can always load the data in these three waves:
Hubs – create hub surrogate keys based on business keys
Links (need hub surrogate keys as FK, and potentially to build a link surrogate key), and Hub Satellites (need the Hub PK which will form the Satellite PK with a load timestamp or a valid from date/date time)
Link Satellites: need the link surrogate key, which will form the PK together with a date/date time stamp.
If a cardinality of a relationship changes, do nothing. The links are designed to handle 1-M, M-1 and M-M.
No need to tweak referential integrity by not promoting the whole PK as FK (usually the valid from date/date time is not promoted)

 

Data Vault and ETL

What I will refer to in this document is the loading part of a DW. Extract will probably be quite similar to an ordinary DW. For transformation, there is not much of it in a Core Data Vault, but a bit more in a Business Data Vault.

As you might have guessed from the what is described above, these are the main activities when loading a Data Vault:

Create surrogate keys
For hubs, based on a business key
For links, based on the surrogate keys (SK) from the participating hubs
Look-up surrogate keys
For links, the hub SKs
For hub satellites, the hub SK
For link satellites, the hub SK, then the link SK
Versioning, create generations of records when attributes change. This only done for satellites.

 

Sample Data Vault Model

This is a sample model that we will use for our example DI Studio load jobs. The different entities are in the de facto colouring standard. This is just a small part of the whole model, and just a subset of attributes are visible. So we can focus on principles, rather than details.

Customer and Account are core business concepts, and represented as hubs. They are mandatory.

The link represent a relation between an account and a customer. But a customer can have multiple account, and a an account can have (model wise) multiple customers connected to it.

The satellites is used to store attributes that describes the core business concepts, or the relationship (links).


 

How Do Can We Use SAS DI Studio for Loading A Data Vault?

Loading a Hub




We map 1-1 from a staging table that contain the business key for the business concept Customer. For private customers this is typically some kind of SSN, if available. The technical column Source System Code is being mapped here, but it could be done in the stage table if that seems more appropriate. The surrogate key and Load date time columns are generated (out-of-the box) in the SCD Type 1 transformation.

As an alternative, you could use the Surrogate Key Generator transformation. But it generates an output of all records (including existing), which you need to take care of separately in a succeeding Table Loader step. Surrogate Key Generator is a so called “Generated Transform” which means that you could customize it as an ETL developer by using SAS Language.

 

Loading a Satellite




Here’s an example of loading a satellite table for Customer. The first step is to get the hub SK by using the business key in a Look-up. The SCD Type 2 Loader transformation handles versioning of data. You should not generate new SK in the satellite. This is because you should never have a Satellite referencing another table directly. Use the Customer hub SK as a business key in the SCD Type 2 transformation.

 

Loading a Link




Loading a link needs typically two steps. First look-up the SK’s for the contributing hubs, by using their business keys.

Then use the SCD Type 1 Loader to manage creation of new link records, and the assigning them a link SK if needed. Why does the link need a SK? If you need to add some attributes to a link, they will be stored in satellite connected to the link. Then the Link SK is needed as FK in that satellite.

The separate Table Loader will capture those records with where there is no match in the key Lookup.  You can chose in the Lookup how to handle these exceptions (move record to exceptions table, abort job etc).

 

Loading a Link Satellite




For completeness, let me show me this case as well, since it differs a bit from loading a satellite connected to a hub. First you need to get the contributing hubs SK’s, by using respective business key. Then, by combining them you will be able to look-up the link SK.

And then, like a hub satellite, use SCD Type 2 Loader to handle versioning of the satellite attributes.

Of course you could/should have exception management in this job as well, but I omitted it to focus on the core components of the job.

 

Conclusion

SAS Data Integration Studio has no specific support for loading a Data Vault. But the Data Vault loading logic is simple, so Data Integration Studio standard transformations can be used as is. The three transformations SCD Type 1, SCD Type 2 and the Look-up transformations is enough to load a standard Data Vault model. If you have large volumes of non-changeable transactional data (topic not discussed here), the Table Loader transformation can be used for straight inserts/appends.t

要查看或添加评论,请登录

Darshika Srivastava的更多文章

  • End User

    End User

    What Is End User? In product development, an end user (sometimes end-user)[a] is a person who ultimately uses or is…

  • METADATA

    METADATA

    WHAT IS METADATA? Often referred to as data that describes other data, metadata is structured reference data that helps…

  • SSL

    SSL

    What is SSL? SSL, or Secure Sockets Layer, is an encryption-based Internet security protocol. It was first developed by…

  • BLOATWARE

    BLOATWARE

    What is bloatware? How to identify and remove it Unwanted pre-installed software -- also known as bloatware -- has long…

  • Data Democratization

    Data Democratization

    What is Data Democratization? Unlocking the Power of Data Cultures For Businesses Data is a vital asset in today's…

  • Rooting

    Rooting

    What is Rooting? Rooting is the process by which users of Android devices can attain privileged control (known as root…

  • Data Strategy

    Data Strategy

    What is a Data Strategy? A data strategy is a long-term plan that defines the technology, processes, people, and rules…

  • Product

    Product

    What is the Definition of Product? Ask a few people that question, and their specific answers will vary, but they’ll…

  • API

    API

    What is an API? APIs are mechanisms that enable two software components to communicate with each other using a set of…

  • Apple Intelligence

    Apple Intelligence

    What Is Apple Intelligence? Apple Intelligence is an artificial intelligence developed by Apple Inc. Relying on a…

社区洞察

其他会员也浏览了