Data Warehouse with a single pipeline in Snowflake

Data Warehouse with a single pipeline in Snowflake

In the previous article, I introduced a single ELT pipeline in Snowflake how to ingest thousands of files into thousands of tables asynchronously & simultaneously. We can apply this method to transform layers for dimensions & facts in Data Warehouse. You just need to change the source path from the external staging to the source of views which contain all of business logic to transform data based upon your requirement in Data Warehouse. This transform layer consists of a bunch of views only which must match to the schema of your target dimension & fact tables.

Transform Layer

Transform Views between Raw database and Data Warehouse need to be implemented with the following rules.

  1. Column names in views should match to the target dimension or fact table.
  2. Business key should be expressed in the target dimension or fact table with an unique constraint besides its own surrogate primary key.
  3. Last updated column should be expressed in the view to filter out newly inserted rows only since the last run. The value should be COALESCE of the latest timestamp among the participating tables in the view in order to be picked up if any value gets changed.

Data Mart Views don't need to follow the rules but it is a good practice to match to your target structure in Analytic & Reporting area.

Single data pipeline for Staging, SCD Type 1 & 2 and Fact

Each single data pipeline consists of two stored procedures, one is for a SQL wrapper and the other is a main stored procedure to call SQL statements in parallel. You can see how to implement a single data pipeline for Staging at https://www.dhirubhai.net/pulse/single-elt-pipeline-snowflake-kyoung-shin/.

For SCD Type 1 & 2, you may consider to implement Hash via MD5, SHA1 or SHA2 with Hub, Link & Satellite approach (mimicking SCD Type 6) in Data Vault in order to support for back-dated dimensions & facts. You can tweak the MERGE statement in the single data pipeline to adjust your custom logic for it. The goal is to completely decouple your business logic from mechanics of data pipeline so that it would be much easier to maintain a large number of objects.

Data Warehouse Architecture, everything within Snowflake!

No alt text provided for this image

Now, we can ingest all of files from Staging to Raw tables simultaneously. Then we can run transformation from Raw tables to Dimension & Fact tables simultaneously. (Referential constraints are ignored in Snowflake) It's truly parallelism thanks to Snowflake's decoupling compute from storage.

With this single data pipeline approach, you can completely decouple the landing area (Data Lake) of files from your data pipeline (although your tables are still schema-bound to your file structure if you're using a structured data format) and completely decouple your business logic from your data pipeline with Transform Views without using any third party scripting language outside of Snowflake. And at last, you can run the entire process simultaneously!

Data Lineage

Snowflake provides a Comment command to tag any text so that you can utilize the command to manage data lineage. The issue is how to build a graphical lineage within a complex business logic in Transform Views. My goal is to build Data Warehouse within Snowflake, hopefully I can come up with a code to present the complex lineage within Transform Views.

Conclusion

Make it simple, repeatable and manageable with high performance. This is what I am trying to build as a best practice for ELT in Snowflake. Hopefully you see some value from my article.

Thanks,

Kyoung Shin

Chief Data Officer at NTERSOL

要查看或添加评论,请登录

Kyoung Shin的更多文章

  • A bug with TRY_TO_NUMBER() in Snowflake

    A bug with TRY_TO_NUMBER() in Snowflake

    It’s weird that TRY_TO_NUMBER function isn’t consistent with all alphabets, Please run the following code, WITH n AS (…

  • More features by SnowPackage for Snowflake CI/CD

    More features by SnowPackage for Snowflake CI/CD

    Last time, we went through demo with SnowPackage for Snowflake CI/CD at LinkedIn. With this CLI, we can build a…

    8 条评论
  • Demo with SnowPackage for Snowflake CI/CD

    Demo with SnowPackage for Snowflake CI/CD

    Today, I am going to demonstrate NTERSOL's SnowPackage as I introduced at LinkedIn. Here is scenario we will go through…

  • Compilation Time in Snowflake

    Compilation Time in Snowflake

    Two weeks ago, I brought up an issue with cloning a database which has a set of large objects at LinkedIn. I've faced…

    2 条评论
  • Monitoring in Snowflake

    Monitoring in Snowflake

    Snowflake's echo system seems to be rapidly evolving as many tools are available in the current market but none of them…

    2 条评论
  • An event driven single ELT pipeline in Snowflake

    An event driven single ELT pipeline in Snowflake

    We've implemented a single ELT pipeline in Snowflake by creating dynamic sub-tasks on the fly in order to run multiple…

    4 条评论
  • A single ELT pipeline in Snowflake

    A single ELT pipeline in Snowflake

    I've consulted many companies to build data pipelines from their on-prem to Snowflake and observed various tools or…

  • Snowflake CI/CD with State Based Approach

    Snowflake CI/CD with State Based Approach

    There have been many discussion and debate about pros & cons between State based and Migration based database schema &…

    8 条评论
  • Epilogue

    Epilogue

    Through several articles, we've examined possibility to build an uni-bus message pipeline with a complete schema-less…

  • COPY INTO Comparison

    COPY INTO Comparison

    COPY INTO command in Snowflake has very interesting behavior. Today, I am going to demonstrate data ingestion of 10…

    2 条评论

社区洞察

其他会员也浏览了