SAS DI
Darshika Srivastava
Associate Project Manager @ HuQuo | MBA,Amity Business School
Using SAS DI Studio To Load A Data Vaul Posted 10-22-2014 04:25 AM | by LinusH (8509 views) Introduction Data Vault has over the last years becoming popular as modelling technique for data warehouse. My experience is that it has not been used widely (yet) within SAS implementations. So my questions are: Could you use SAS Data Integration for this? Is it recommended? This article is targeted to audiences with little/no knowledge of Data Vault, nor SAS Data Integrations Studio. So bare with me if you feel that some parts seem abundantly clear. What is SAS Data Integration Studio? SAS Data Integration Studio is an ETL tool offered by SAS Institute, and is a part of their Data Management portfolio. It lets you build and maintain metadata for databases, entities and jobs. It contains a set of standard transformations that help you with copy, map, transform and load your data. The metadata is the basis for deploying job code, which is in SAS Language. The SAS Language is a combination of: ANSI SQL with SAS extensions Vendor RDBMS specific SQL Data step language Extensive function library Procedures for standard data and statistical tasks Macro language to create parametrized and reusable code What is Data Vault? Data Vault is a data modelling paradigm, first presented by Dan Linstedt. It is aimed for the atomic layer of a data warehouse. And it was developed to answer some of the problems that modelling a data warehouse using the Third normal form (3NF). The simplest description on what Data Vault is, is to describe the three core types of entities: Hub: identifies a business concept. Contains only the business key for that concept (alongside with standard technical attributes such as Load Time, Source System Code and of course a surrogate key (SK)). No versioning – no valid from/to dates. Link: establishes relationships between hubs. Always describes M-M, regardless of the nature of current real world relationship. No versioning. Satellite: contains descriptive data; codes, names, numerical values, business dates etc. This data is always versioned. A Satellite can only have a relationship to one mother table: a hub or a link (less common). There is also an expressed best practice that you can have two persistent layers in a data vault warehouse: Core Data Vault: Reads all data, stores all data. No transformation other than maintaining business keys, and mapping to common column names and data types. Considered as the “fact” of the data warehouse Business Data Vault: also an atomic layer, but is the result after applying common business and transformation rules. These could include calculations, common code structures, data quality initiatives, customer segmentation and other statistical models. Why Use Data Vault? There are several benefits of using the Data Vault (DV) paradigm over 3NF. Some of them are (not limited to): Flexible: when adding new data sources to your DW, usually no need to redesign existing tables – just add new Agile: the flexibility lets you build small portions in each sprint. Extensions to the model usually means separate Hubs and Links. Predictable loading pattern. 3NF DW usually needs many waves to load, so it can maintain referential integrity. As an example, the SAS model for the Insurance industry needs 23 waves. A Data Vault only needs 3! How can that be - there is no exact corresponding version of the SAS Insurance industry model? Well because that a Data Vault you can always load the data in these three waves: Hubs – create hub surrogate keys based on business keys Links (need hub surrogate keys as FK, and potentially to build a link surrogate key), and Hub Satellites (need the Hub PK which will form the Satellite PK with a load timestamp or a valid from date/date time) Link Satellites: need the link surrogate key, which will form the PK together with a date/date time stamp. If a cardinality of a relationship changes, do nothing. The links are designed to handle 1-M, M-1 and M-M. No need to tweak referential integrity by not promoting the whole PK as FK (usually the valid from date/date time is not promoted) Data Vault and ETL What I will refer to in this document is the loading part of a DW. Extract will probably be quite similar to an ordinary DW. For transformation, there is not much of it in a Core Data Vault, but a bit more in a Business Data Vault. As you might have guessed from the what is described above, these are the main activities when loading a Data Vault: Create surrogate keys For hubs, based on a business key For links, based on the surrogate keys (SK) from the participating hubs Look-up surrogate keys For links, the hub SKs For hub satellites, the hub SK For link satellites, the hub SK, then the link SK Versioning, create generations of records when attributes change. This only done for satellites. Sample Data Vault Model This is a sample model that we will use for our example DI Studio load jobs. The different entities are in the de facto colouring standard. This is just a small part of the whole model, and just a subset of attributes are visible. So we can focus on principles, rather than details. Customer and Account are core business concepts, and represented as hubs. They are mandatory. The link represent a relation between an account and a customer. But a customer can have multiple account, and a an account can have (model wise) multiple customers connected to it. The satellites is used to store attributes that describes the core business concepts, or the relationship (links). How Do Can We Use SAS DI Studio for Loading A Data Vault? Loading a Hub We map 1-1 from a staging table that contain the business key for the business concept Customer. For private customers this is typically some kind of SSN, if available. The technical column Source System Code is being mapped here, but it could be done in the stage table if that seems more appropriate. The surrogate key and Load date time columns are generated (out-of-the box) in the SCD Type 1 transformation. As an alternative, you could use the Surrogate Key Generator transformation. But it generates an output of all records (including existing), which you need to take care of separately in a succeeding Table Loader step. Surrogate Key Generator is a so called “Generated Transform” which means that you could customize it as an ETL developer by using SAS Language. Loading a Satellite Here’s an example of loading a satellite table for Customer. The first step is to get the hub SK by using the business key in a Look-up. The SCD Type 2 Loader transformation handles versioning of data. You should not generate new SK in the satellite. This is because you should never have a Satellite referencing another table directly. Use the Customer hub SK as a business key in the SCD Type 2 transformation. Loading a Link Loading a link needs typically two steps. First look-up the SK’s for the contributing hubs, by using their business keys. Then use the SCD Type 1 Loader to manage creation of new link records, and the assigning them a link SK if needed. Why does the link need a SK? If you need to add some attributes to a link, they will be stored in satellite connected to the link. Then the Link SK is needed as FK in that satellite. The separate Table Loader will capture those records with where there is no match in the key Lookup. You can chose in the Lookup how to handle these exceptions (move record to exceptions table, abort job etc). Loading a Link Satellite For completeness, let me show me this case as well, since it differs a bit from loading a satellite connected to a hub. First you need to get the contributing hubs SK’s, by using respective business key. Then, by combining them you will be able to look-up the link SK. And then, like a hub satellite, use SCD Type 2 Loader to handle versioning of the satellite attributes. Of course you could/should have exception management in this job as well, but I omitted it to focus on the core components of the job. Conclusion SAS Data Integration Studio has no specific support for loading a Data Vault. But the Data Vault loading logic is simple, so Data Integration Studio standard transformations can be used as is. The three transformations SCD Type 1, SCD Type 2 and the Look-up transformations is enough to load a standard Data Vault model. If you have large volumes of non-changeable transactional data (topic not discussed here), the Table Loader transformation can be used for straight inserts/appends.t