Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs

Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs

In my previous article, Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture, I described challenges and solutions to building a data analytics architecture before development of the application had even begun. In addition to the four requirements below discussed in that article:

  1. The primary source of data will be a custom Salesforce application, using both standard and custom Salesforce Objects.
  2. Every change to every Salesforce Object record must be captured.
  3. The architecture must support the full data life cycle, from development through production and maintenance.
  4. There will be ongoing changes to the Salesforce Object fields over time.

There is fifth requirement:

  • Data must be available within 10 seconds of creation/update in Salesforce.

As?reminder, the data is delivered via Kafka Connector, which is set to load available data into Snowflake every 10 seconds.

For any given Salesforce object, this effectively becomes single row INSERTs, resulting in single row micro-partitions. This will result in ongoing declining performance.

The challenge is having the data available and still maintain good performance.

STREAMs to the rescue!

Recall that the architecture is INSERT only and that VIEWs are used for all data access. (See Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture).

The solution is both simple and tunable. (It’s not clear if all Salesforce objects require 10-second availability).

We create both staging tables loaded by Kafka Connector and “optimized” tables, e.g. ACCOUNT_KAFKA_STG, the staging table and ACCOUNT, the “optimized” table.

CREATE TABLE 
    SFORCE_PHYSICAL_SCHEMA.ACCOUNT IF NOT EXISTS 
    ( 
        LAST_MODIFIED_DTIME TIMESTAMP_TZ(9), 
        CREATED_DTIME TIMESTAMP_TZ(9), 
        RECORD_CONTENT VARIANT 
    ) 
    DATA_RETENTION_TIME_IN_DAYS = 90;
CREATE TABLE 
    SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG IF NOT EXISTS 
    ( 
        RECORD_METADATA VARIANT, 
        RECORD_CONTENT VARIANT 
    ) 
    DATA_RETENTION_TIME_IN_DAYS = 30;        

We create STREAMs on the staging tables, e.g. ACCOUNT_KAFKA_STREAM. The relevant VIEWS perform a UNION ALL between the STREAM and the optimized table.

CREATE STREAM 
IF NOT EXISTS SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM ON TABLE 
    SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG APPEND_ONLY = TRUE SHOW_INITIAL_ROWS =FALSE;        

A TASK is created for each set of Snowflake tables which periodically performs INSERT from the STREAM into the optimized table:

CREATE
OR 
REPLACE TASK 
SFORCE_PHYSICAL_SCHEMA.INSERT_INTO_ACCOUNT_TASK 
SCHEDULE = '1440 MINUTE'
ALLOW_OVERLAPPING_EXECUTION FALSE WAREHOUSE = 
SFORCE_TASK_USAGE_WH USER_TASK_TIMEOUT_MS 300000
WHEN SYSTEM$STREAM_HAS_DATA('SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM') AS
    INSERT INTO 
        SFORCE_PHYSICAL_SCHEMA.ACCOUNT
        (   SELECT
                convert_timezone( 'America/Los_Angeles', TO_TIMESTAMP_TZ(record_content 
                :Target_Payload__c.LastModifiedDate::STRING, 'YYYY-MM-DD"T" HH24:MI:SS.FF TZHTZM'
                )) AS LAST_MODIFIED_DTIME,
                convert_timezone( 'America/Los_Angeles', TO_TIMESTAMP_TZ(record_content 
                :Target_Payload__c.CreatedDate::STRING, 'YYYY-MM-DD"T" HH24:MI:SS.FF TZHTZM' )) AS 
                CREATED_DTIME
            FROM 
                SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM 
            ORDER BY 
                LAST_MODIFIED_DTIME 
        );        

When a TASK is created,it is SUSPENDed; be sure to start/RESUME it:

ALTER TASK SFORCE_PHYSICAL_SCHEMA.INSERT_INTO_ACCOUNT_TASK RESUME;        

The default schedule for the TASKs is 24 hours, resulting in a full day’s data in a micro-partition. Note that both the Kafka Connector settings and the TASK scheduling are tunable if necessary in the future.

This provides both the real-time availability as well as more efficient use of Snowflake micro-partitions.

A second task periodically removes old micro-partitions from the staging tables.

This pattern provide a consistent solution for the project’s availability requirement in a flexible manner.

Copyright ? 2022, Jeffrey Jacobs & Associate, LLC

要查看或添加评论,请登录

Jeffrey Jacobs的更多文章

社区洞察

其他会员也浏览了