登录查看更多内容

Data Warehouse with a single pipeline in Snowflake

Kyoung Shin

Senior Vice President, Enterprise Data Strategy & Infrastructure

发布日期: 2020年11月9日

In the previous article, I introduced a single ELT pipeline in Snowflake how to ingest thousands of files into thousands of tables asynchronously & simultaneously. We can apply this method to transform layers for dimensions & facts in Data Warehouse. You just need to change the source path from the external staging to the source of views which contain all of business logic to transform data based upon your requirement in Data Warehouse. This transform layer consists of a bunch of views only which must match to the schema of your target dimension & fact tables.

Transform Layer

Transform Views between Raw database and Data Warehouse need to be implemented with the following rules.

Column names in views should match to the target dimension or fact table.
Business key should be expressed in the target dimension or fact table with an unique constraint besides its own surrogate primary key.
Last updated column should be expressed in the view to filter out newly inserted rows only since the last run. The value should be COALESCE of the latest timestamp among the participating tables in the view in order to be picked up if any value gets changed.

Data Mart Views don't need to follow the rules but it is a good practice to match to your target structure in Analytic & Reporting area.

Single data pipeline for Staging, SCD Type 1 & 2 and Fact

Each single data pipeline consists of two stored procedures, one is for a SQL wrapper and the other is a main stored procedure to call SQL statements in parallel. You can see how to implement a single data pipeline for Staging at https://www.dhirubhai.net/pulse/single-elt-pipeline-snowflake-kyoung-shin/.

For SCD Type 1 & 2, you may consider to implement Hash via MD5, SHA1 or SHA2 with Hub, Link & Satellite approach (mimicking SCD Type 6) in Data Vault in order to support for back-dated dimensions & facts. You can tweak the MERGE statement in the single data pipeline to adjust your custom logic for it. The goal is to completely decouple your business logic from mechanics of data pipeline so that it would be much easier to maintain a large number of objects.

Data Warehouse Architecture, everything within Snowflake!

Now, we can ingest all of files from Staging to Raw tables simultaneously. Then we can run transformation from Raw tables to Dimension & Fact tables simultaneously. (Referential constraints are ignored in Snowflake) It's truly parallelism thanks to Snowflake's decoupling compute from storage.

With this single data pipeline approach, you can completely decouple the landing area (Data Lake) of files from your data pipeline (although your tables are still schema-bound to your file structure if you're using a structured data format) and completely decouple your business logic from your data pipeline with Transform Views without using any third party scripting language outside of Snowflake. And at last, you can run the entire process simultaneously!

Data Lineage

Snowflake provides a Comment command to tag any text so that you can utilize the command to manage data lineage. The issue is how to build a graphical lineage within a complex business logic in Transform Views. My goal is to build Data Warehouse within Snowflake, hopefully I can come up with a code to present the complex lineage within Transform Views.

Conclusion

Make it simple, repeatable and manageable with high performance. This is what I am trying to build as a best practice for ELT in Snowflake. Hopefully you see some value from my article.

Thanks,

Kyoung Shin

Chief Data Officer at NTERSOL

要查看或添加评论，请登录

Kyoung Shin的更多文章

A bug with TRY_TO_NUMBER() in Snowflake

2022年7月26日

A bug with TRY_TO_NUMBER() in Snowflake

It’s weird that TRY_TO_NUMBER function isn’t consistent with all alphabets, Please run the following code, WITH n AS (…
More features by SnowPackage for Snowflake CI/CD

2020年12月22日

More features by SnowPackage for Snowflake CI/CD

Last time, we went through demo with SnowPackage for Snowflake CI/CD at LinkedIn. With this CLI, we can build a…

8 条评论
Demo with SnowPackage for Snowflake CI/CD

2020年12月15日

Demo with SnowPackage for Snowflake CI/CD

Today, I am going to demonstrate NTERSOL's SnowPackage as I introduced at LinkedIn. Here is scenario we will go through…
Compilation Time in Snowflake

2020年12月8日

Compilation Time in Snowflake

Two weeks ago, I brought up an issue with cloning a database which has a set of large objects at LinkedIn. I've faced…

2 条评论
Monitoring in Snowflake

2020年11月11日

Monitoring in Snowflake

Snowflake's echo system seems to be rapidly evolving as many tools are available in the current market but none of them…

2 条评论
An event driven single ELT pipeline in Snowflake

2020年11月10日

An event driven single ELT pipeline in Snowflake

We've implemented a single ELT pipeline in Snowflake by creating dynamic sub-tasks on the fly in order to run multiple…

4 条评论
A single ELT pipeline in Snowflake

2020年11月8日

A single ELT pipeline in Snowflake

I've consulted many companies to build data pipelines from their on-prem to Snowflake and observed various tools or…
Snowflake CI/CD with State Based Approach

2020年11月3日

Snowflake CI/CD with State Based Approach

There have been many discussion and debate about pros & cons between State based and Migration based database schema &…

8 条评论
Epilogue

2020年8月5日

Epilogue

Through several articles, we've examined possibility to build an uni-bus message pipeline with a complete schema-less…
COPY INTO Comparison

2020年8月5日

COPY INTO Comparison

COPY INTO command in Snowflake has very interesting behavior. Today, I am going to demonstrate data ingestion of 10…

2 条评论

See all articles

Data Warehouse with a single pipeline in Snowflake

Kyoung Shin

Senior Vice President, Enterprise Data Strategy & Infrastructure

Transform Layer

Single data pipeline for Staging, SCD Type 1 & 2 and Fact

Data Warehouse Architecture, everything within Snowflake!

Data Lineage

Conclusion

Kyoung Shin的更多文章

社区洞察

其他会员也浏览了

Revamp your data needs with Lyftrondata ANSI SQL Data Pipeline

Data Modeling for Mere Mortals – Part 3: All we need is a Data Lakehouse?!

Data-Driven Insights: From Chaos to Clarity

Special edition: 19 gotchas to look out for when evaluating data lineage

What is a Data Lake? A Super-Simple Explanation For Anyone

Data Vault is no longer required in our Archipelago on the high Seas of Data

Snowflake: The Ultimate Solution for Data Warehousing and Analytics

Overview of Data Architectures

Establishing Foundation Knowledge: "The Data Warehouse Toolkit" by R. Kimball and M. Ross

Data Warehousing with Star and Snowflake schemas

Transform Layer

Single data pipeline for Staging, SCD Type 1 & 2 and Fact

Data Warehouse Architecture, everything within Snowflake!

Data Lineage

Conclusion

Kyoung Shin的更多文章

A bug with TRY_TO_NUMBER() in Snowflake

More features by SnowPackage for Snowflake CI/CD

Demo with SnowPackage for Snowflake CI/CD

Compilation Time in Snowflake

Monitoring in Snowflake

An event driven single ELT pipeline in Snowflake

A single ELT pipeline in Snowflake

Snowflake CI/CD with State Based Approach

Epilogue

COPY INTO Comparison

社区洞察

其他会员也浏览了

Revamp your data needs with Lyftrondata ANSI SQL Data Pipeline

Data Modeling for Mere Mortals – Part 3: All we need is a Data Lakehouse?!

Data-Driven Insights: From Chaos to Clarity

Special edition: 19 gotchas to look out for when evaluating data lineage

What is a Data Lake? A Super-Simple Explanation For Anyone

Data Vault is no longer required in our Archipelago on the high Seas of Data

Snowflake: The Ultimate Solution for Data Warehousing and Analytics

Overview of Data Architectures

Establishing Foundation Knowledge: "The Data Warehouse Toolkit" by R. Kimball and M. Ross

Data Warehousing with Star and Snowflake schemas