Simplifying time variance in a SQL data warehouse
One of the issues many coming into the data warehouse world have difficulty with is managing time variance and non volatility at scale and efficiency.?One of the basic methods is versioning denormalized structures with a start date and end date in to what are commonly called "slowly changing dimensions", but this can have adverse affects.?Denormalization threatens the warehouse with update anomalies and joining tables using start dates and end dates is complex, which puts more load on infrastructure and staff.?Timeseries database platforms are available, but often restrict how time based relations can be queried.
One method of implementing time variance that resolves many of these issues leverages surrogation over time at the individual relation level.?This accellorates queries as they become very simple, and can be used with high relational fidelity.
For the sake of simplicity, only uni-temporal workloads will be addressed here, though these methods can be expanded to multitemporal management by adding to the key.
Consider this table exists in a transactional system and we want to integrate it into the data warehouse:
CREATE TABLE suppliers (
SupplierID INT (10) unsigned NOT NULL AUTO_INCREMENT
,CompanyName VARCHAR(40)?NOT NULL DEFAULT ''
,ContactName VARCHAR(30)?NOT NULL DEFAULT ''
,ContactTitle VARCHAR(30)?NOT NULL DEFAULT ''
,Address VARCHAR(60)?NOT NULL DEFAULT ''
,City VARCHAR(15)?NOT NULL DEFAULT ''
,Region VARCHAR(15)?NOT NULL DEFAULT ''
,PostalCode VARCHAR(10)?NOT NULL DEFAULT ''
,Country VARCHAR(15)?NOT NULL DEFAULT ''
,Phone VARCHAR(24)?NOT NULL DEFAULT ''
,Fax VARCHAR(24)?NOT NULL DEFAULT ''
,HomePage VARCHAR(255) NOT NULL DEFAULT ''
,PRIMARY KEY (SupplierID)
,KEY idx_suppliers_product_name(CompanyName)
,KEY idx_suppliers_postalcode(PostalCode)
) ENGINE = InnoDB AUTO_INCREMENT = 30 DEFAULT CHARSET = utf8;
The first step is to export this table from the source system to a filesystem somewhere that it can be ingested into the data warehouse.?A number of methods are available, but for this document we're going to assume the whole table is snapshot to the filesystem rather than depending on a log based CDC mechanism.?Given that we're snapshotting, we'll make it easier by appending a timestamp to each row in the snapshot:
select *, current_timestamp as exporttime from suppliers;
Once exported, we'll import it directly into a staging table with a very similar schema to the source in the data warehouse.
CREATE TABLE stage_suppliers (
SupplierID INT
,CompanyName string
,ContactName string
,ContactTitle string
,Address string
,City string
,Region string
,PostalCode string
,Country string
,Phone string
,Fax string
,HomePage string
,exporttime timestamp_ntz
);
Our destination table is going to look similar, but we're going to add some metadata specific to managing temporality:
CREATE TABLE suppliers (
SupplierID INT
,SupplierDWID BIGINT
,CompanyName string
,ContactName string
,ContactTitle string
,Address string
,City string
,Region string
,PostalCode string
,Country string
,Phone string
,Fax string
,HomePage string
,dw_starttime timestamp_ntz
,dw_endtime timestamp_ntz
,dw_row_hash BIGINT
);
We've added:?
First load is very straight forward, as there's no previous records to compare to.?One issue on first load is you don't have any history, you only know the current state of each row, there's no way to know how long they've been in that state.?So we have to make an assumption that they've ALWAYS been in this state.?The other thing to remember is they're still in that state, so the dw_endtime can't be populated with a known value.?You've got two options:?assume it's a seed date sometime way in the future, or leave it nullable.?Either works.
INSERT INTO suppliers
SELECT SupplierID
,(
SELECT coalesce(max(SupplierID), 0) + row_number() OVER (
ORDER BY SupplierID --order doesn't actually matter, but row_number requires it
)
FROM suppliers
)
,--surrogating without using identity, some platforms don't support identity or auto increment CompanyName
,ContactName
,ContactTitle
,Address
,City
,Region
,PostalCode
,Country
,Phone
,Fax
,HomePage
,'1990-01-01'--this company isn't that old, it'll work for first load
,'9999-01-01'--I've chosen to seed the end date
,MD5(CONCAT (
CompanyName
,ContactName
,ContactTitle
,Address
,City
,Region
,PostalCode
领英推荐
,Country
,Phone
,Fax
,HomePage
)) --hashing the values of the row using postgres syntax, feel free to use the appropriate hash function for your platform;
Once loaded, each supplier/starttime will have a new surrogate dw_id.?You'll also likely want to add rows to the table to be used in child relations when you want to represent what would be a null.?That way we can differentiate between the types of null (unknown, not applicable etc.)
From there, continuous ingestion is very similar to the first load with a couple of added steps:
When a new export is available, truncate and reload the staging table.
First step is to deal with those records that have been deleted or updated, we'll set the "dw_endtime" to the last export time from the source system:
--find all the rows in suppliers where there is no stage_supplier with the same row hash and set the end time
UPDATE suppliers
SET dw_endtime = (
SELECT max(exporttime) AS endtime
FROM stage_suppliers
)
FROM suppliers
LEFT OUTER JOIN stage_suppliers ON stage_suppliers.SpuplierID = suppliers.SupplierID
AND MD5(CONCAT (
stage_suppliers.CompanyName
,stage_suppliers.ContactName
,stage_suppliers.ContactTitle
,stage_suppliers.Address
,stage_suppliers.City
,stage_suppliers.Region
,stage_suppliers.PostalCode
,stage_suppliers.Country
,stage_suppliers.Phone
,stage_suppliers.Fax
,stage_suppliers.HomePage
)) = suppliers.dw_row_hash
WHERE suppliers.dw_endtime > (
SELECT max(exporttime) AS endtime
FROM stage_suppliers
)
AND stage_suppliers.SupplierID IS NULL;
Last, insert the new and temporally changed rows:
INSERT INTO suppliers
SELECT SupplierID
,(
SELECT coalesce(max(SupplierID), 0) + row_number() OVER (
ORDER BY SupplierID
)
FROM suppliers
)--surrogating without using identity, some platforms don't support identity or auto increment
,CompanyName
,ContactName
,ContactTitle
,Address
,City
,Region
,PostalCode
,Country
,Phone
,Fax
,HomePage
,exporttime
,'9999-01-01'?--I've chosen to seed the end date
,MD5(CONCAT (
CompanyName
,ContactName
,ContactTitle
,Address
,City
,Region
,PostalCode
,Country
,Phone
,Fax
,HomePage
));
We now have a fully temporal surrogated relation.?We can repeat the exact same process with child relations, the only difference is that we add the parent temporal surrogate to the child's key AND include it in the row_hash.?Once this is done, cascading changes through the relations is automatic.?If a supplier changes, all of it's products will change automatically and temporal integrity is enforced throughout the entire schema.
Product would go from?
create table product (ProductID int, SupplierID int,....etc.)
To?
create table product(ProductID int, ProductDWID int, SupplierID int, SupplierDWID int ....etc.)
The one catch with child relations is you'll want to add the parent DWID to the row hash.?We'll continue to cascade the surrogates to every child of a child of a child.??
The result is a dramatic reduction of complexity at query time.?If I want to make it easy for users to see "current", we can just cut a view for each base table like:
create view product_current as
select * from
product
where dw_endtime = '9999-01-01';
And they can treat it like an operational data store.
If you want to get results over time in a query, just join on the DWIDs rather than a compound join on start_dates and end_dates.
Global Customer Servant | Driver and Believer in Customer Service and Success
1 年Nice quick read.
Make sure when you use concat for your hash always to start with a varchar(max) to make sure it is not truncated also concat has a max amount of columns … before somebody creates a hash that will not modify while it should (if on the MS platform)
Data Engineer | Platform
1 年Robert, This makes operational sense for child normalisation. I just have one issue whenever hashing is concerned. When would SQL engines give us some functional way of calling all column names without writing them all out. The closest I've seen is BigQuery's "EXCEPT" which allows us select all cols with some exceptions. Once read of someone hashing about 50+ cols. Such hassle if one is not auto-generating this with eg;Python.
Business Geek/Data Leader
1 年Interesting concepts, thanks for sharing. What do you say is the main advantage of your approach to a Data Vault? I saw some similarities with Data Vault but I wasn't able to appreciate all the diferences, though. (Nice picture! One of the most beloved incarnations!)
Data & Analytics, Microsoft MVP
1 年Interesting write up. A couple of thoughts: - Is this not just Type II SCD with surrogate keys? - Since the hash of the staged data is used in the check against existing rows and in the eventual insert, might it be slightly quicker to hash once on insert to staging rather than hash in two places?