Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs
In my previous article, Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture, I described challenges and solutions to building a data analytics architecture before development of the application had even begun. In addition to the four requirements below discussed in that article:
There is fifth requirement:
As?reminder, the data is delivered via Kafka Connector, which is set to load available data into Snowflake every 10 seconds.
For any given Salesforce object, this effectively becomes single row INSERTs, resulting in single row micro-partitions. This will result in ongoing declining performance.
The challenge is having the data available and still maintain good performance.
STREAMs to the rescue!
Recall that the architecture is INSERT only and that VIEWs are used for all data access. (See Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture).
The solution is both simple and tunable. (It’s not clear if all Salesforce objects require 10-second availability).
We create both staging tables loaded by Kafka Connector and “optimized” tables, e.g. ACCOUNT_KAFKA_STG, the staging table and ACCOUNT, the “optimized” table.
领英推荐
CREATE TABLE
SFORCE_PHYSICAL_SCHEMA.ACCOUNT IF NOT EXISTS
(
LAST_MODIFIED_DTIME TIMESTAMP_TZ(9),
CREATED_DTIME TIMESTAMP_TZ(9),
RECORD_CONTENT VARIANT
)
DATA_RETENTION_TIME_IN_DAYS = 90;
CREATE TABLE
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG IF NOT EXISTS
(
RECORD_METADATA VARIANT,
RECORD_CONTENT VARIANT
)
DATA_RETENTION_TIME_IN_DAYS = 30;
We create STREAMs on the staging tables, e.g. ACCOUNT_KAFKA_STREAM. The relevant VIEWS perform a UNION ALL between the STREAM and the optimized table.
CREATE STREAM
IF NOT EXISTS SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM ON TABLE
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STG APPEND_ONLY = TRUE SHOW_INITIAL_ROWS =FALSE;
A TASK is created for each set of Snowflake tables which periodically performs INSERT from the STREAM into the optimized table:
CREATE
OR
REPLACE TASK
SFORCE_PHYSICAL_SCHEMA.INSERT_INTO_ACCOUNT_TASK
SCHEDULE = '1440 MINUTE'
ALLOW_OVERLAPPING_EXECUTION FALSE WAREHOUSE =
SFORCE_TASK_USAGE_WH USER_TASK_TIMEOUT_MS 300000
WHEN SYSTEM$STREAM_HAS_DATA('SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM') AS
INSERT INTO
SFORCE_PHYSICAL_SCHEMA.ACCOUNT
( SELECT
convert_timezone( 'America/Los_Angeles', TO_TIMESTAMP_TZ(record_content
:Target_Payload__c.LastModifiedDate::STRING, 'YYYY-MM-DD"T" HH24:MI:SS.FF TZHTZM'
)) AS LAST_MODIFIED_DTIME,
convert_timezone( 'America/Los_Angeles', TO_TIMESTAMP_TZ(record_content
:Target_Payload__c.CreatedDate::STRING, 'YYYY-MM-DD"T" HH24:MI:SS.FF TZHTZM' )) AS
CREATED_DTIME
FROM
SFORCE_PHYSICAL_SCHEMA.ACCOUNT_KAFKA_STREAM
ORDER BY
LAST_MODIFIED_DTIME
);
When a TASK is created,it is SUSPENDed; be sure to start/RESUME it:
ALTER TASK SFORCE_PHYSICAL_SCHEMA.INSERT_INTO_ACCOUNT_TASK RESUME;
The default schedule for the TASKs is 24 hours, resulting in a full day’s data in a micro-partition. Note that both the Kafka Connector settings and the TASK scheduling are tunable if necessary in the future.
This provides both the real-time availability as well as more efficient use of Snowflake micro-partitions.
A second task periodically removes old micro-partitions from the staging tables.
This pattern provide a consistent solution for the project’s availability requirement in a flexible manner.
Copyright ? 2022, Jeffrey Jacobs & Associate, LLC