登录查看更多内容

Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture

Jeffrey Jacobs

CTO/Founder of AltaSQL.io

发布日期: 2021年11月15日

This article is a case study of a real-world implementation. The client and purpose of the application are not relevant to the architecture.

My previous articles on Snowflake and ELVT],?Snowflake and ELVT vs [ELT, ETL], Part 1?and?Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture?have been primarily theoretical, only alluding to implementations using this technique.

The client presented a unique challenge. Create an architecture before the application and its data model is even defined or implemented!

The only known “requirements” are:

The primary source of data will be a custom Salesforce application, using both standard and custom Salesforce Objects.
Every change to every Salesforce Object record must be captured.
The architecture must support the full data life cycle, from development through production and maintenance.
There will be ongoing changes to the Salesforce Object fields over time.

At this point, the reader might be saying to themselves, as I did initially, “No big deal, use a data integration product such as Fivetran”. This will handle both data and schema changes.

Except there is a 5th?requirement; “the solution must be FedDRAMP Mod” certified! None of the currently available data integration offerings meet this requirement.As a result,?Kafka and JSON form the data pipeline from Salesforce to Snowflake. The pipeline is repurposed from a prior implementation for Salesforce to PostgreSQL replication.

How to cope with all the above?

A Snowflake based Salesforce “Object” DDL Generator! Each Salesforce Object results in three VIEWs, two TABLEs, one STREAM and one TASK in Snowflake for capturing the Salesforce data for reporting and analytics.

The Object Generator consists of metadata TABLEs and UDFs for generating DDL physical data TABLEs, STREAMs and VIEWs corresponding to Salesforce Objects.

Meta-data Tables

A meta-data table, SFORCE_META_DATA, consisting of meta-data extracted from Salesforce with the following columns:

SFORCE_OBJECT – the name used for creating the various Snowflake objects e.g., ACCOUNT for standard object or CUSTOM for a custom object, e.g., CUSTOM__c.
API_NAME – the Salesforce field API Name – used to extract the appropriate JSON field for the VIEW column name
FIELD_LABEL – the Salesforce field label – this becomes the VIEW column name.
DATA_TYPE – the Salesforce data type.
Security Classification – used for data masking

A meta-data table, SFORCE_JSON_CASTING, for casting the Salesforce data type to Snowflake with the following columns:

SFORCE_DATA_TYPE – the Salesforce data type
CAST_TO – the desired Snowflake data type
IGNORE_FLAG – ignore columns with this data type

A Calendar dimension table, with a DATE column

Physical Model

Every Salesforce Object, e.g., ACCOUNT and CONTACT, will have the following corresponding?physical?Snowflake objects:

A staging TABLE for Kafka payload, e.g., ACCOUNT_KAFKA_STG, with the two standard VARIANT/JSON columns from the Kafka Connector.:

RECORD_META_DATA
RECORD_CONTENT

An "append only" STREAM, e.g., ACCOUNT_KAFKA_STREAM on the staging table, e.g., ACCOUNT_KAFKA_STG

A "history" TABLE, e.g., ACCOUNT, holding all data from Salesforce with the following columns:

LAST_MODIFIED_DTIME?– the last timestamp the source Salesforce record was modified
CREATED_DTIME – the time the record was created in Salesforce
RECORD_CONTENT – the VARIANT column from the staging table STREAM

A TASK, e.g., ACCOUNT_INSERT_INTO_TASK – performs the ELT from the STREAM, removing unwanted JSON objects and filling the DTIME columns.

UDFs are created to generate the DDL for each of the physical objects. Each of the UDFs takes the desired object name, e.g., ‘ACCOUNT’ and returns the DDL for creating the appropriate object.

The “standardized” UDFs:

UDF_GEN_SFORCE_OBJECT_TABLE(‘ACCOUNT’)?produces:

CREATE TABLE 
    SFORCE_PHYSICAL_SCHEMA.ACCOUNT IF NOT EXISTS 
    (LAST_MODIFIED_DTIME TIMESTAMP_TZ(9), 
     CREATED_DTIME TIMESTAMP_TZ(9), 
     RECORD_CONTENT VARIANT) 
    DATA_RETENTION_TIME_IN_DAYS = 90;

UDF_GEN_SFORCE_KAFKA_STG_TABLE(‘ACCOUNT’) produces:

CREATE TABLE 
    SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG IF NOT EXISTS 
    (RECORD_METADATA VARIANT, 
        RECORD_CONTENT VARIANT) 
    DATA_RETENTION_TIME_IN_DAYS = 30;

UDF_GEN_SFORCE_KAFKA_STREAM(‘ACCOUNT’) produces:

CREATE STREAM IF NOT EXISTS 
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM ON TABLE
  SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG 
  APPEND_ONLY = TRUE SHOW_INITIAL_ROWS =false;

UDF_GEN_INSERT_INTO_TASK (‘ACCOUNT’) produces:

CREATE OR
REPLACE TASK SFORCE_PHYSICAL_SCHEMA.INSERT_INTO_ACCOUNT_TASK S
SCHEDULE = '1440 MINUTE'
ALLOW_OVERLAPPING_EXECUTION  = FALSE WAREHOUSE =
SFORCE_TASK_USAGE_WH USER_TASK_TIMEOUT_MS      = 300000
WHEN SYSTEM$STREAM_HAS_DATA('SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM') 
AS
INSERT INTO SFORCE_PHYSICAL_SCHEMA.ACCOUNT
  (SELECT  
convert_timezone('America/Los_Angeles',record_content:LastModifiedDate) AS
LAST_MODIFIED_DTIME,
convert_timezone('America/Los_Angeles',record_content:CreatedDate::timestamp) AS
CREATED_DTIME
FROM
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM
ORDER BY
LAST_MODIFIED_DTIME);

There are corresponding overloaded UDFs that have additional parameters to govern the options used in the DDL such as schemas and CREATE OR REPLACE/IF NOT EXISTS options. The “standardized” UDFs call these functions with standard settings.

This means that?all?generated objects have the?same?structure; only the names differ. The deployment schema is SFORCE_PHYSICAL_SCHEMA. Changes to the Salesforce Object do not require DDL ALTERing any physical objects.

This approach facilitates the development and deployment process. Eventually, an automated pipeline will create the physical objects when new Salesforce Objects are added. Until then,?we can substantially automate the process by loading the SFORCE_META_DATA table from a csv file.

This allows us to create simple SQL to produce scripts, e.g.

SELECT
UDF_GEN_SFORCE_OBJECT_TABLE(‘ACCOUNT’)
UNION ALL 
SELECT UDF_GEN_SFORCE_KAFKA_STG(‘ACCOUNT’)
UNION ALL
SELECT UDF_GEN_SFORCE_KAFKA_STREAM(‘ACCOUNT’);

We combine UDFs calls into a single UDF creating DDL to deploy all physical objects:

UDF_DEPLOY_SFORCE_PHYSICAL(object_name, VARCHAR)

We can then generate a script for deploying either the entire set of Salesforce Objects or a subset with a simple SQL statement, e.g.:

SELECT 
    UDF_DEPLOY_SFORCE_PHYSICAL(SFORCE_OBJECT)
FROM
    (   SELECT 
            DISTINCT SFORCE_OBJECT
        FROM 
            SFORCE_META_DATA);

Presentation VIEW Model

One of the “best practices” with JSON is to use VIEWs for presentation. JSON fields are shredded into relational columns only as needed, typically for performance reasons. A common use case is converting a JSON UTC column to a desired time zone. In this application, the LAST_MODIFIED_DTIME and CREATED_DTIME are particularly important and need to be in Pacific Time.

As the number and structure of the Salesforce Objects is both unknown and certain to be subject to frequent changes, a simple Snowflake VIEW generator is the key part of this architecture.

The application’s requirement to keep a history of every change to every record in every Salesforce Object. This presents challenges common to a temporal database. In this application, most metrics are counts and date ranges instead of numeric values, but the same techniques are applicable.

领英推荐

Snowflake Data Marketplace

Lyftrondata 2 个月前

Snowflake Data Marketplace

Lyftrondata 4 个月前

Special edition: 19 gotchas to look out for when…

Prukalpa ? 2 年前

As the rate of change differs between Salesforce Objects, there is no reliable join between objects for historical data. A “normalization” technique is needed to enable consistent joins between Snowflake objects.

The solution is a three-layer stack of VIEWs:

The “historical” VIEW. This VIEW presents the data for every row in the TABLE. These VIEWs are referred to and named with “_HIST” as the suffix, e.g., VW_ACCOUNT_HIST.
The “as of date” VIEW. This VIEW presents a daily snapshot of the values of each record at the end of every calendar DATE, e.g.,

WHERE MAX(LAST_MODIFIED_DTIME) <=?DIM_CALENDAR_DATE.DATE

The suffix is “_AS_OF_DATE”, e.g., VW_ACCOUNT_AS_OF_DATE. This allows meaningful JOINs between object based on the appropriate IDs and the AS_OF_DATE column, e.g.,

JOIN ON  VW_CONTACT_AS_OF_DATE.ACCOUNT_ID = VW_ACCOUNT_AS_OF_DATE.ACCOUNT_ID
AND
VW_CONTACT_AS_OF_DATE.AS_OF_DATE = VW_ACCOUNT_AS_OF_DATE.AS_OF_DATE

PERFORMANCE NOTE: The above is the "logical" join condition.  Due to ACCOUNT_ID being an 18 character string, Snowflake (and most DBs) perform very poorly joining on long characters string. The actual joins use HASH(AS_OF_DATE, <name>_ID), for the join keys.

3. The “current” record status, with the suffice “_CURR” This is simply:

SELECT… FROM… VW_object_AS_OF_DATE WHERE AS_OF_DATE = CURRENT_DATE()

Joins only require appropriate ID columns, as seen below.

The “as of date” and “current” VIEWs are the primary VIEWs for the anticipated analytics and KPIs.

Generating VIEWs

Let’s discuss the key UDFs used to generate the VIEWs.

UDF_GEN_VW_HIST(‘object_name’) generates the first level of VIEWs DDL, the “history” _HIST VIEW. This is the “presentation”/SOPR VIEW as described in?Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture.

UDF_GEN_VW_HIST uses the SFORCE_META_DATA and SFORCE_JSON_CASTING tables to generate the CREATE VIEW DDL. It maps each JSON API field to a VIEW column based on the field label, casting each field according to the SFORCE_JSON_CASTING mapping.

UDF_GEN_VW_HIST is composed of calls to several functions. The key child function is UDF_GEN_TARGET_COLUMNS. This function creates the columns in the VIEW for both standard fields and custom fields. It uses the FIELD_LABEL as the VIEW’s column name in mixed case, except for the ID field and fields with data types of REFERENCE and LOOKUP; for these, the resulting column name is upper, snake_case. The ID field will be name as <object_name>_ID. A partial example for VW_CONTACT_HIST:

RECORD_CONTENT:Target_Payload__c.AccountId::STRING AS "ACCOUNT_ID" , 
RECORD_CONTENT:Target_Payload__c.Annual_Income__c::STRING AS "Annual Income" , 
RECORD_CONTENT:Target_Payload__c.Birthdate::DATE   AS "Birthdate" , 
RECORD_CONTENT:Target_Payload__c.Id::STRING	   AS "CONTACT_ID" , 
RECORD_CONTENT:Target_Payload__c.CreatedById::STRING 	AS "CREATED_BY_ID" , 
RECORD_CONTENT:Target_Payload__c.FirstName::STRING 	AS "First Name"

UDF_GEN_VW_AS_OF_DATE generates the second level VIEW DDL, joining the _HIST VIEW with DIM_CALENDAR_DATE as noted above.

UDF_GEN_VW_CURR generates the third level VIEW’s DDL. It is simply

SELECT <columns> FROM VW_object_AS_OF_DATE WHERE AS_OF_DATE = CURRENT_DATE

All the generated VIEWs use CREATE OR REPLACE syntax. This allows simple updates to the VIEWs based on changes to the SFORCE_META_DATA table.?NOTE:?due to Snowflake’s compiling of VIEWs, all three VIEWs need to be recreated in the following order when the meta-data for the Salesforce Object changes:

_HIST
_AS_OF_DATE
_CURR

as both _AS_OF_DATE and _CURR use SELECT from the lower level VIEW in their DDL, which is resolved at CREATE VIEW execution with text substitution.

UDF_DEPLOY_SFORCE_VIEWS(‘object_name’) generates a DDL script for creating the three VIEWs.

The same simple SQL technique as noted above for the physical objects can be used to generate a script to create or refresh multiple objects.

Query Examples

Let’s examine the results for a Salesforce Account Object selected from the different VIEW types.

SELECT 
    "Account Name",
    ACCOUNT_ID,
    last_modified_dtime,
    TO_DATE(LAST_MODIFIED_DTIME)
FROM
    VW_ACCOUNT_HIST
WHERE
    ACCOUNT_ID = '0013R000003pNu8QAE'
ORDER BY
    to_Date(last_modified_dtime);

returns:

Note that the account had several updates on two different dates, 10/22 and 10/29.

Let’s see the data for the daily snapshot from VW_ACCOUNT_AS_OF_DATE:

Note that the results remain the same from the last update on 10/22 through 10/28 and change on 10/29.

Let’s look at the current status, VW_ACCOUNT_CURR:

Note that the status has not changed since 10/29.

Finally, let’s see the current values for this Account and its CONTACTs:

SELECT 
    "Account Name", 
    "First Name", 
    "Birthdate", 
    "Driver's License State"
FROM 
    VW_ACCOUNT_CURR ACCT
JOIN 
    VW_CONTACT_CURR CNTC
ON 
    ACCT.ACCOUNT_ID = CNTC.ACCOUNT_ID
WHERE 
    ACCT.ACCOUNT_ID = '0013R000003pNu8QAE';

produces:

Summary

The architecture describe in this article has numerous advantages:

Every Salesforce Object in the application has/will have a corresponding set of physical objects
All the physical objects have the identical structure
The physical objects are all generated from meta-data
The default standard for all generated physical objects is idempotent, using appropriate?IF NOT EXISTS?or?OR REPLACE?in the CREATE statements
The generated DDL is idempotent
Only the _HIST VIEWs DDL needs to be generated for new or changed Salesforce Objects
As noted for VIEWs, all three DDL definitions need to be executed. Other VIEWs may also need to be recreated
ETL/ELT is minimized and standardized
The Kafka connector is used for ELT from Salesforce
Tasks are used to copy the data from the Kafka stage TABLE to the main table
Monitoring the ETL/ELT processes is via queries to the SNOWFLAKE.ACCOUNT_USAGE schema
Using JSON/VARIANT as the primary persistence store allows the Salesforce Objects to be change before exposing them in the VIEWs; the new data will be present when finally exposed in the VIEWs
There is no need for traditional schema maintenance, as the architecture is truly “schema on read”.

Epilogue

The architecture and examples in this article were developed months and tested months before the first Salesforce Objects were implemented. While the application is still in its initial stages, 22 standard and custom Salesforce Objects have been implemented producing 161 Snowflake objects. No changes have been required to the architecture.

Chris A.

Managing Enterprise Data for Maximum Business ROI

8 个月

Fits the definition of well engineered: as simple as possible, but no simpler. And, both clever and obvious at the same time. ??

Laura Hansen

ETL Architect

3 年

Really brilliant work.

Jason Merenda

3 年

This is excellent!

Stephen Andert

Technical Evangelism | Database SME | Solutions Architect/Engineer | Database Specialist | Latin American Market Expansion Specialist, Introverted Public Speaker, Bilingual/Multi-Cultural, Ex-IBM

3 年

Jeannine Crownover, I know you've been playing with #Snowflake, so I thought you might be interested in this.

查看更多评论

要查看或添加评论，请登录

Jeffrey Jacobs的更多文章

AltaSQL Bulk Updating Expressions Across All View Definitions

2025年2月23日

AltaSQL Bulk Updating Expressions Across All View Definitions

Bulk Updating Expressions Across All View Definitions Using AltaSQL AltaSQL allows users to efficiently apply changes…
Generating 50+ SQL Statements in Under 10 Minutes; No Hand Written SQL!

2025年2月13日

Generating 50+ SQL Statements in Under 10 Minutes; No Hand Written SQL!

The AltaSQL SELECT Discover Demo uses the Chinook Music database whose columns are in CamelCase. This is the same demo…
Pivot ANYTHING in Snowflake, Without the SQL PIVOT Function

2022年5月16日

Pivot ANYTHING in Snowflake, Without the SQL PIVOT Function

The SQL PIVOT function has very limited functionality. It is only useful for numeric data, with very explicit, "hard…

4 条评论
Snowflake DBA-101; Deploying Standardized, Fully Functional Databases

2022年3月20日

Snowflake DBA-101; Deploying Standardized, Fully Functional Databases

This article provides both a guide and script for creating and deploying standardized, fully functional, Snowflake…

1 条评论
Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs

2022年1月23日

Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs

In my previous article, Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture, I…
Snowflake and Duo Mobile, How to Lose Login Ability and MFA Across Multiple Accounts:ERROR: "USER IS NOT ENROLLED IN DUO SECURITY. CONTACT YOUR LOCAL

2022年1月8日

Snowflake and Duo Mobile, How to Lose Login Ability and MFA Across Multiple Accounts:ERROR: "USER IS NOT ENROLLED IN DUO SECURITY. CONTACT YOUR LOCAL

In working with multiple Snowflake accounts, I found a serious issue that I have not seen addressed elsewhere. NOTE:…

7 条评论
Snowflake vs Databricks: TPCS-DS Benchmark Wars – Who Cares?

2021年11月27日

Snowflake vs Databricks: TPCS-DS Benchmark Wars – Who Cares?

This article discusses the relevance of the recent TPC-DS “results” to potential customers. Let’s start with…

2 条评论
Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture

2021年5月16日

Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture

This is the promised follow up to my last article, Snowflake and EVLT vs [ELT, ETL], Part 1, discussing the advantages…

1 条评论
Snowflake and ELVT vs [ELT, ETL], Part 1

2021年4月16日

Snowflake and ELVT vs [ELT, ETL], Part 1

Over several generations of RDBMS technologies, I have learned that common practices, knowledge, and attitudes become…

23 条评论
Snowflake Micro-partition vs Legacy Macro-partition Pruning

2021年4月4日

Snowflake Micro-partition vs Legacy Macro-partition Pruning

I have been in the data business through several RDBM generations and have seen many attempts at comparing performance…

2 条评论

See all articles

Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture

Jeffrey Jacobs

CTO/Founder of AltaSQL.io

Meta-data Tables

Physical Model

Presentation VIEW Model

领英推荐

Generating VIEWs

Query Examples

Summary

Epilogue

Jeffrey Jacobs的更多文章

社区洞察

其他会员也浏览了

Alvin + Mode: Data Lineage and Usage for your Reports & Charts

Data-Driven Insights: From Chaos to Clarity

Data Vault is no longer required in our Archipelago on the high Seas of Data

Data Management News for the Week of June 7; Updates from Cloudera, Snowflake, Informatica & More

Data vault builder

Load data from Salesforce to Snow?ake in minutes

Data Architecture-as-a-Service: Liberation for Data Users

MDS Newsletter #31

Data Integration from Fabric Lakehouse to Snowflake Database using Data Pipeline

How to Seamlessly Ingest and Sync Data from Salesforce to Snowflake using Clicks not Code with a Fivetran Connector!

Meta-data Tables

Physical Model

Presentation VIEW Model

领英推荐

Generating VIEWs

Query Examples

Summary

Epilogue

Jeffrey Jacobs的更多文章

AltaSQL Bulk Updating Expressions Across All View Definitions

Generating 50+ SQL Statements in Under 10 Minutes; No Hand Written SQL!

Pivot ANYTHING in Snowflake, Without the SQL PIVOT Function

Snowflake DBA-101; Deploying Standardized, Fully Functional Databases

Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs

Snowflake and Duo Mobile, How to Lose Login Ability and MFA Across Multiple Accounts:ERROR: "USER IS NOT ENROLLED IN DUO SECURITY. CONTACT YOUR LOCAL

Snowflake vs Databricks: TPCS-DS Benchmark Wars – Who Cares?

Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture

Snowflake and ELVT vs [ELT, ETL], Part 1

Snowflake Micro-partition vs Legacy Macro-partition Pruning

社区洞察

其他会员也浏览了

Alvin + Mode: Data Lineage and Usage for your Reports & Charts

Data-Driven Insights: From Chaos to Clarity

Data Vault is no longer required in our Archipelago on the high Seas of Data

Data Management News for the Week of June 7; Updates from Cloudera, Snowflake, Informatica & More

Data vault builder

Load data from Salesforce to Snow?ake in minutes

Data Architecture-as-a-Service: Liberation for Data Users

MDS Newsletter #31

Data Integration from Fabric Lakehouse to Snowflake Database using Data Pipeline

How to Seamlessly Ingest and Sync Data from Salesforce to Snowflake using Clicks not Code with a Fivetran Connector!