Data Warehouse with a single pipeline in Snowflake
In the previous article, I introduced a single ELT pipeline in Snowflake how to ingest thousands of files into thousands of tables asynchronously & simultaneously. We can apply this method to transform layers for dimensions & facts in Data Warehouse. You just need to change the source path from the external staging to the source of views which contain all of business logic to transform data based upon your requirement in Data Warehouse. This transform layer consists of a bunch of views only which must match to the schema of your target dimension & fact tables.
Transform Layer
Transform Views between Raw database and Data Warehouse need to be implemented with the following rules.
- Column names in views should match to the target dimension or fact table.
- Business key should be expressed in the target dimension or fact table with an unique constraint besides its own surrogate primary key.
- Last updated column should be expressed in the view to filter out newly inserted rows only since the last run. The value should be COALESCE of the latest timestamp among the participating tables in the view in order to be picked up if any value gets changed.
Data Mart Views don't need to follow the rules but it is a good practice to match to your target structure in Analytic & Reporting area.
Single data pipeline for Staging, SCD Type 1 & 2 and Fact
Each single data pipeline consists of two stored procedures, one is for a SQL wrapper and the other is a main stored procedure to call SQL statements in parallel. You can see how to implement a single data pipeline for Staging at https://www.dhirubhai.net/pulse/single-elt-pipeline-snowflake-kyoung-shin/.
For SCD Type 1 & 2, you may consider to implement Hash via MD5, SHA1 or SHA2 with Hub, Link & Satellite approach (mimicking SCD Type 6) in Data Vault in order to support for back-dated dimensions & facts. You can tweak the MERGE statement in the single data pipeline to adjust your custom logic for it. The goal is to completely decouple your business logic from mechanics of data pipeline so that it would be much easier to maintain a large number of objects.
Data Warehouse Architecture, everything within Snowflake!
Now, we can ingest all of files from Staging to Raw tables simultaneously. Then we can run transformation from Raw tables to Dimension & Fact tables simultaneously. (Referential constraints are ignored in Snowflake) It's truly parallelism thanks to Snowflake's decoupling compute from storage.
With this single data pipeline approach, you can completely decouple the landing area (Data Lake) of files from your data pipeline (although your tables are still schema-bound to your file structure if you're using a structured data format) and completely decouple your business logic from your data pipeline with Transform Views without using any third party scripting language outside of Snowflake. And at last, you can run the entire process simultaneously!
Data Lineage
Snowflake provides a Comment command to tag any text so that you can utilize the command to manage data lineage. The issue is how to build a graphical lineage within a complex business logic in Transform Views. My goal is to build Data Warehouse within Snowflake, hopefully I can come up with a code to present the complex lineage within Transform Views.
Conclusion
Make it simple, repeatable and manageable with high performance. This is what I am trying to build as a best practice for ELT in Snowflake. Hopefully you see some value from my article.
Thanks,
Kyoung Shin
Chief Data Officer at NTERSOL