Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture
This article is a case study of a real-world implementation. The client and purpose of the application are not relevant to the architecture.
My previous articles on Snowflake and ELVT],?Snowflake and ELVT vs [ELT, ETL], Part 1?and?Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture?have been primarily theoretical, only alluding to implementations using this technique.
The client presented a unique challenge. Create an architecture before the application and its data model is even defined or implemented!
The only known “requirements” are:
At this point, the reader might be saying to themselves, as I did initially, “No big deal, use a data integration product such as Fivetran”. This will handle both data and schema changes.
Except there is a 5th?requirement; “the solution must be FedDRAMP Mod” certified! None of the currently available data integration offerings meet this requirement.As a result,?Kafka and JSON form the data pipeline from Salesforce to Snowflake. The pipeline is repurposed from a prior implementation for Salesforce to PostgreSQL replication.
How to cope with all the above?
A Snowflake based Salesforce “Object” DDL Generator! Each Salesforce Object results in three VIEWs, two TABLEs, one STREAM and one TASK in Snowflake for capturing the Salesforce data for reporting and analytics.
The Object Generator consists of metadata TABLEs and UDFs for generating DDL physical data TABLEs, STREAMs and VIEWs corresponding to Salesforce Objects.
Meta-data Tables
A meta-data table, SFORCE_META_DATA, consisting of meta-data extracted from Salesforce with the following columns:
A meta-data table, SFORCE_JSON_CASTING, for casting the Salesforce data type to Snowflake with the following columns:
A Calendar dimension table, with a DATE column
Physical Model
Every Salesforce Object, e.g., ACCOUNT and CONTACT, will have the following corresponding?physical?Snowflake objects:
A staging TABLE for Kafka payload, e.g., ACCOUNT_KAFKA_STG, with the two standard VARIANT/JSON columns from the Kafka Connector.:
An "append only" STREAM, e.g., ACCOUNT_KAFKA_STREAM on the staging table, e.g., ACCOUNT_KAFKA_STG
A "history" TABLE, e.g., ACCOUNT, holding all data from Salesforce with the following columns:
A TASK, e.g., ACCOUNT_INSERT_INTO_TASK – performs the ELT from the STREAM, removing unwanted JSON objects and filling the DTIME columns.
UDFs are created to generate the DDL for each of the physical objects. Each of the UDFs takes the desired object name, e.g., ‘ACCOUNT’ and returns the DDL for creating the appropriate object.
The “standardized” UDFs:
CREATE TABLE
SFORCE_PHYSICAL_SCHEMA.ACCOUNT IF NOT EXISTS
(LAST_MODIFIED_DTIME TIMESTAMP_TZ(9),
CREATED_DTIME TIMESTAMP_TZ(9),
RECORD_CONTENT VARIANT)
DATA_RETENTION_TIME_IN_DAYS = 90;
CREATE TABLE
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG IF NOT EXISTS
(RECORD_METADATA VARIANT,
RECORD_CONTENT VARIANT)
DATA_RETENTION_TIME_IN_DAYS = 30;
CREATE STREAM IF NOT EXISTS
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM ON TABLE
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG
APPEND_ONLY = TRUE SHOW_INITIAL_ROWS =false;
CREATE OR
REPLACE TASK SFORCE_PHYSICAL_SCHEMA.INSERT_INTO_ACCOUNT_TASK S
SCHEDULE = '1440 MINUTE'
ALLOW_OVERLAPPING_EXECUTION = FALSE WAREHOUSE =
SFORCE_TASK_USAGE_WH USER_TASK_TIMEOUT_MS = 300000
WHEN SYSTEM$STREAM_HAS_DATA('SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM')
AS
INSERT INTO SFORCE_PHYSICAL_SCHEMA.ACCOUNT
(SELECT
convert_timezone('America/Los_Angeles',record_content:LastModifiedDate) AS
LAST_MODIFIED_DTIME,
convert_timezone('America/Los_Angeles',record_content:CreatedDate::timestamp) AS
CREATED_DTIME
FROM
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM
ORDER BY
LAST_MODIFIED_DTIME);
There are corresponding overloaded UDFs that have additional parameters to govern the options used in the DDL such as schemas and CREATE OR REPLACE/IF NOT EXISTS options. The “standardized” UDFs call these functions with standard settings.
This means that?all?generated objects have the?same?structure; only the names differ. The deployment schema is SFORCE_PHYSICAL_SCHEMA. Changes to the Salesforce Object do not require DDL ALTERing any physical objects.
This approach facilitates the development and deployment process. Eventually, an automated pipeline will create the physical objects when new Salesforce Objects are added. Until then,?we can substantially automate the process by loading the SFORCE_META_DATA table from a csv file.
This allows us to create simple SQL to produce scripts, e.g.
SELECT
UDF_GEN_SFORCE_OBJECT_TABLE(‘ACCOUNT’)
UNION ALL
SELECT UDF_GEN_SFORCE_KAFKA_STG(‘ACCOUNT’)
UNION ALL
SELECT UDF_GEN_SFORCE_KAFKA_STREAM(‘ACCOUNT’);
We combine UDFs calls into a single UDF creating DDL to deploy all physical objects:
UDF_DEPLOY_SFORCE_PHYSICAL(object_name, VARCHAR)
We can then generate a script for deploying either the entire set of Salesforce Objects or a subset with a simple SQL statement, e.g.:
SELECT
UDF_DEPLOY_SFORCE_PHYSICAL(SFORCE_OBJECT)
FROM
( SELECT
DISTINCT SFORCE_OBJECT
FROM
SFORCE_META_DATA);
Presentation VIEW Model
One of the “best practices” with JSON is to use VIEWs for presentation. JSON fields are shredded into relational columns only as needed, typically for performance reasons. A common use case is converting a JSON UTC column to a desired time zone. In this application, the LAST_MODIFIED_DTIME and CREATED_DTIME are particularly important and need to be in Pacific Time.
As the number and structure of the Salesforce Objects is both unknown and certain to be subject to frequent changes, a simple Snowflake VIEW generator is the key part of this architecture.
The application’s requirement to keep a history of every change to every record in every Salesforce Object. This presents challenges common to a temporal database. In this application, most metrics are counts and date ranges instead of numeric values, but the same techniques are applicable.
领英推荐
As the rate of change differs between Salesforce Objects, there is no reliable join between objects for historical data. A “normalization” technique is needed to enable consistent joins between Snowflake objects.
The solution is a three-layer stack of VIEWs:
WHERE MAX(LAST_MODIFIED_DTIME) <=?DIM_CALENDAR_DATE.DATE
The suffix is “_AS_OF_DATE”, e.g., VW_ACCOUNT_AS_OF_DATE. This allows meaningful JOINs between object based on the appropriate IDs and the AS_OF_DATE column, e.g.,
JOIN ON VW_CONTACT_AS_OF_DATE.ACCOUNT_ID = VW_ACCOUNT_AS_OF_DATE.ACCOUNT_ID
AND
VW_CONTACT_AS_OF_DATE.AS_OF_DATE = VW_ACCOUNT_AS_OF_DATE.AS_OF_DATE
PERFORMANCE NOTE: The above is the "logical" join condition. Due to ACCOUNT_ID being an 18 character string, Snowflake (and most DBs) perform very poorly joining on long characters string. The actual joins use HASH(AS_OF_DATE, <name>_ID), for the join keys.
3. The “current” record status, with the suffice “_CURR” This is simply:
SELECT… FROM… VW_object_AS_OF_DATE WHERE AS_OF_DATE = CURRENT_DATE()
Joins only require appropriate ID columns, as seen below.
The “as of date” and “current” VIEWs are the primary VIEWs for the anticipated analytics and KPIs.
Generating VIEWs
Let’s discuss the key UDFs used to generate the VIEWs.
UDF_GEN_VW_HIST(‘object_name’) generates the first level of VIEWs DDL, the “history” _HIST VIEW. This is the “presentation”/SOPR VIEW as described in?Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture.
UDF_GEN_VW_HIST uses the SFORCE_META_DATA and SFORCE_JSON_CASTING tables to generate the CREATE VIEW DDL. It maps each JSON API field to a VIEW column based on the field label, casting each field according to the SFORCE_JSON_CASTING mapping.
UDF_GEN_VW_HIST is composed of calls to several functions. The key child function is UDF_GEN_TARGET_COLUMNS. This function creates the columns in the VIEW for both standard fields and custom fields. It uses the FIELD_LABEL as the VIEW’s column name in mixed case, except for the ID field and fields with data types of REFERENCE and LOOKUP; for these, the resulting column name is upper, snake_case. The ID field will be name as <object_name>_ID. A partial example for VW_CONTACT_HIST:
RECORD_CONTENT:Target_Payload__c.AccountId::STRING AS "ACCOUNT_ID" ,
RECORD_CONTENT:Target_Payload__c.Annual_Income__c::STRING AS "Annual Income" ,
RECORD_CONTENT:Target_Payload__c.Birthdate::DATE AS "Birthdate" ,
RECORD_CONTENT:Target_Payload__c.Id::STRING AS "CONTACT_ID" ,
RECORD_CONTENT:Target_Payload__c.CreatedById::STRING AS "CREATED_BY_ID" ,
RECORD_CONTENT:Target_Payload__c.FirstName::STRING AS "First Name"
UDF_GEN_VW_AS_OF_DATE generates the second level VIEW DDL, joining the _HIST VIEW with DIM_CALENDAR_DATE as noted above.
UDF_GEN_VW_CURR generates the third level VIEW’s DDL. It is simply
SELECT <columns> FROM VW_object_AS_OF_DATE WHERE AS_OF_DATE = CURRENT_DATE
All the generated VIEWs use CREATE OR REPLACE syntax. This allows simple updates to the VIEWs based on changes to the SFORCE_META_DATA table.?NOTE:?due to Snowflake’s compiling of VIEWs, all three VIEWs need to be recreated in the following order when the meta-data for the Salesforce Object changes:
as both _AS_OF_DATE and _CURR use SELECT from the lower level VIEW in their DDL, which is resolved at CREATE VIEW execution with text substitution.
UDF_DEPLOY_SFORCE_VIEWS(‘object_name’) generates a DDL script for creating the three VIEWs.
The same simple SQL technique as noted above for the physical objects can be used to generate a script to create or refresh multiple objects.
Query Examples
Let’s examine the results for a Salesforce Account Object selected from the different VIEW types.
SELECT
"Account Name",
ACCOUNT_ID,
last_modified_dtime,
TO_DATE(LAST_MODIFIED_DTIME)
FROM
VW_ACCOUNT_HIST
WHERE
ACCOUNT_ID = '0013R000003pNu8QAE'
ORDER BY
to_Date(last_modified_dtime);
returns:
Note that the account had several updates on two different dates, 10/22 and 10/29.
Let’s see the data for the daily snapshot from VW_ACCOUNT_AS_OF_DATE:
Note that the results remain the same from the last update on 10/22 through 10/28 and change on 10/29.
Let’s look at the current status, VW_ACCOUNT_CURR:
Note that the status has not changed since 10/29.
Finally, let’s see the current values for this Account and its CONTACTs:
SELECT
"Account Name",
"First Name",
"Birthdate",
"Driver's License State"
FROM
VW_ACCOUNT_CURR ACCT
JOIN
VW_CONTACT_CURR CNTC
ON
ACCT.ACCOUNT_ID = CNTC.ACCOUNT_ID
WHERE
ACCT.ACCOUNT_ID = '0013R000003pNu8QAE';
produces:
Summary
The architecture describe in this article has numerous advantages:
Epilogue
The architecture and examples in this article were developed months and tested months before the first Salesforce Objects were implemented. While the application is still in its initial stages, 22 standard and custom Salesforce Objects have been implemented producing 161 Snowflake objects. No changes have been required to the architecture.
Copyright ?2021, Jeffrey Jacobs & Associates, LLC
Managing Enterprise Data for Maximum Business ROI
8 个月Fits the definition of well engineered: as simple as possible, but no simpler. And, both clever and obvious at the same time. ??
ETL Architect
3 年Really brilliant work.
This is excellent!
Technical Evangelism | Database SME | Solutions Architect/Engineer | Database Specialist | Latin American Market Expansion Specialist, Introverted Public Speaker, Bilingual/Multi-Cultural, Ex-IBM
3 年Jeannine Crownover, I know you've been playing with #Snowflake, so I thought you might be interested in this.