Time Series Data
A common pattern in Software Development and Data Engineering is storing Time Series Data (a sequence of data points collected or recorded at successive points in time, typically at evenly spaced intervals). I have found the existing patterns lacking.
Why do we need a better approach?
The Solution
Extend the basic principles of Slowly Changing Dimensions Type 2 (Typically only focus on Add and Append) to make it more practical and usable in the real world with much-needed extended features. This does make the solution more complicated, I would not do much of these in EF without first-class support.
Normal Actions
Add: New items added will start with the EffectiveDate as the FromDate and the ToDate will be NULL (Representing the current active value)
Append: When a new value is added it will fetch the previous record and update the ToDate to the new FromDate scoping its validity period. The new record will be added the same way the original add happened with the ToDate as NULL.
Extended Actions
Delete: You should be able to delete existing entries, this will require you to wire up previous/next records to maintain the integrity of the timeline
Insert: You can insert between existing records allowing for more granular values between existing values. Very importantly, you cannot insert with the same FromDate if that happens you first need to delete the existing value.
Eligibility: Here is the main area of focus and my contribution. You can identify gaps without polluting your data or breaking the data integrity.
领英推荐
An Example
In the below example, I have written a script that simulates multiple events, how to persist them in SCD with eligibility, and then how to query them fluidly across a larger period.
7 Events were Processed, 3 Records Persisted and 10 Intervals Represented
The periodic view can be extended to include an Unavailable flag with imputed values from adjacent values while indicating that the real values are unavailable. This will depend on the consumer use case but can give you maximum flexibility and insight.
Also in the case of precise measurements, you might want to drop the precision while retaining the accuracy to keep the volume of records lower e.g. reduce decimal places or use a band not the actual value.
DECLARE @baseDate datetime = '2024-03-20';
DECLARE @validityPeriodInDays int = 1;
DECLARE @TheSourceId int = 101;
/* ======== PHASE 1: Populate the Input Data =========== */
-- A queue representing events that occur
DECLARE @eventQueue TABLE (
EventID int IDENTITY(1,1),
SourceId int,
[Value] varchar(200),
EffectiveDate datetime
);
-- Example events in specific order
INSERT INTO @eventQueue (SourceId, [Value], EffectiveDate)
VALUES (@TheSourceId, 'A', DATEADD(DAY, 1, @baseDate)),
(@TheSourceId, 'A', DATEADD(DAY, 4, @baseDate)),
(@TheSourceId, 'B', DATEADD(DAY, 5, @baseDate)),
(@TheSourceId, 'B', DATEADD(DAY, 6, @baseDate)),
(@TheSourceId, 'B', DATEADD(DAY, 7, @baseDate)),
(@TheSourceId, 'B', DATEADD(DAY, 8, @baseDate)),
(@TheSourceId, 'B', DATEADD(DAY, 9, @baseDate));
-- Show the input events
select * from @eventQueue
-- Table for risk score history (The timeseries persisted data)
DECLARE @eventTimeseries TABLE (
Id int IDENTITY(1,1),
SourceId int,
[Value] varchar(200),
EligibleUntil datetime,
FromDate datetime,
ToDate datetime NULL
);
-- Process the event queue items one at a time (Simulate Real world updates)
DECLARE @rowIndex int = 1;
DECLARE @totalRows int;
SELECT @totalRows = COUNT(*) FROM @eventQueue;
WHILE @rowIndex <= @totalRows
BEGIN
/* ======== PHASE 2: Process the Timeseries =========== */
-- Fetch the current event
DECLARE @Value varchar(200);
DECLARE @EffectiveDate datetime;
DECLARE @SourceId int;
SELECT @Value = [Value], @EffectiveDate = EffectiveDate, @SourceId = SourceId
FROM (
SELECT [Value], EffectiveDate, SourceId, ROW_NUMBER() OVER (ORDER BY EffectiveDate) AS RowNum
FROM @eventQueue
) AS EventData
WHERE RowNum = @rowIndex;
DECLARE @previousId int;
DECLARE @previousFromDate datetime;
DECLARE @previousToDate datetime;
DECLARE @previousEligibleUntil datetime;
DECLARE @currentId int;
DECLARE @currentValue varchar(200);
DECLARE @currentEligibleUntil datetime;
DECLARE @currentFromDate datetime;
DECLARE @currentToDate datetime;
DECLARE @DoExtendEligibility bit = 0;
DECLARE @DoAppendNewRecord bit = 0;
DECLARE @DoClosePreviousRecord bit = 0;
DECLARE @DoCloseCurrentRecord bit = 0;
-- Build the row order
WITH OrderedRows AS (
SELECT
Id, SourceId, [Value], EligibleUntil, FromDate, ToDate,
ROW_NUMBER() OVER (PARTITION BY SourceId ORDER BY FromDate DESC) AS RowNum
FROM @eventTimeseries
WHERE SourceId = @SourceId
AND @EffectiveDate >= FromDate
AND ((ToDate IS NOT NULL AND @EffectiveDate <= ToDate) OR (ToDate IS NULL))
)
-- Select the current and previous rows (RowNum 1 = current, RowNum 2 = previous)
SELECT
-- Current row values (RowNum = 1)
@currentId = CASE WHEN RowNum = 1 THEN Id ELSE @currentId END,
@currentValue = CASE WHEN RowNum = 1 THEN [Value] ELSE @currentValue END,
@currentFromDate = CASE WHEN RowNum = 1 THEN FromDate ELSE @currentFromDate END,
@currentToDate = CASE WHEN RowNum = 1 THEN ToDate ELSE @currentToDate END,
@currentEligibleUntil = CASE WHEN RowNum = 1 THEN EligibleUntil ELSE @currentEligibleUntil END,
-- Previous row values (RowNum = 2)
@previousId = CASE WHEN RowNum = 2 THEN Id ELSE @previousId END,
@previousFromDate = CASE WHEN RowNum = 2 THEN FromDate ELSE @previousFromDate END,
@previousToDate = CASE WHEN RowNum = 2 THEN ToDate ELSE @previousToDate END,
@previousEligibleUntil = CASE WHEN RowNum = 2 THEN EligibleUntil ELSE @previousEligibleUntil END
FROM OrderedRows
WHERE RowNum IN (1, 2);
-- Extend: Current value is same and still eligible
IF (@currentId IS NOT NULL AND @Value = @currentValue AND @EffectiveDate <= @currentEligibleUntil)
BEGIN
SET @DoExtendEligibility = 1;
Print CONCAT(CONVERT(varchar(10), @EffectiveDate) ,' - Extend')
END
-- New: There is no existing records
ELSE IF (@currentId IS NULL AND @previousId IS NULL)
BEGIN
SET @DoAppendNewRecord = 1;
Print CONCAT(CONVERT(varchar(10), @EffectiveDate) ,' - New')
END
-- Append: Previous is tip of timeline
ELSE IF (@currentId IS NULL AND @previousId IS NOT NULL AND @previousToDate IS NULL)
BEGIN
SET @DoClosePreviousRecord = 1;
SET @DoAppendNewRecord = 1;
Print CONCAT(CONVERT(varchar(10), @EffectiveDate) ,' - Append on previous')
END
-- Append: Current is tip of timeline
ELSE IF (@currentId IS NOT NULL AND @currentToDate IS NULL)
BEGIN
SET @DoCloseCurrentRecord = 1;
SET @DoAppendNewRecord = 1;
Print CONCAT(CONVERT(varchar(10), @EffectiveDate) ,' - Append on current - ', @currentValue, ' - ', @Value, ' - ', CONVERT(varchar(10), @currentEligibleUntil))
END
-- Insert: Find and existing record for this period and split it into two
IF (@DoExtendEligibility = 1)
BEGIN
UPDATE @eventTimeseries
SET EligibleUntil = DATEADD(Day, @validityPeriodInDays, @EffectiveDate)
WHERE Id = @currentId;
END
IF (@DoAppendNewRecord = 1)
BEGIN
INSERT INTO @eventTimeseries (SourceId, [Value], EligibleUntil, FromDate, ToDate)
VALUES (@SourceId, @Value, DATEADD(Day, @validityPeriodInDays, @EffectiveDate), @EffectiveDate, NULL)
END
IF (@DoClosePreviousRecord = 1)
BEGIN
IF (@previousEligibleUntil >= @EffectiveDate)
BEGIN
-- Close and link to current from date
UPDATE @eventTimeseries
SET ToDate = @EffectiveDate
WHERE Id = @previousId;
END
ELSE
BEGIN
-- Close and link to max eligible time
UPDATE @eventTimeseries
SET ToDate = @previousEligibleUntil
WHERE Id = @previousId;
END
END
IF (@DoCloseCurrentRecord = 1)
BEGIN
IF (@currentEligibleUntil >= @EffectiveDate)
BEGIN
-- Close and link to current from date
UPDATE @eventTimeseries
SET ToDate = @EffectiveDate
WHERE Id = @currentId;
END
ELSE
BEGIN
-- Close and link to max eligible time
UPDATE @eventTimeseries
SET ToDate = @currentEligibleUntil
WHERE Id = @currentId;
END
END
-- Increment to process the next row
SET @rowIndex = @rowIndex + 1;
END
-- View the raw data stored in time series
SELECT * FROM @eventTimeseries;
/* ======== PHASE 3: Show the historic data over time =========== */
-- View the daily values from the timeseries
DECLARE @StartDate datetime = DATEADD(DAY, 1, @baseDate);
DECLARE @EndDate datetime = DATEADD(DAY, 10, @baseDate);
-- Tally Table to generate a sequence of dates
WITH DateRange AS (
SELECT @StartDate AS [Date]
UNION ALL
SELECT DATEADD(DAY, 1, [Date])
FROM DateRange
WHERE [Date] < @EndDate
)
-- Final table to match dates with the corresponding SCD values
SELECT
DR.[Date],
@TheSourceId as SourceId,
ET.[Value]
FROM
DateRange DR
LEFT JOIN
@eventTimeseries ET
ON DR.[Date] >= ET.FromDate
AND ((ET.ToDate IS NULL AND ET.EligibleUntil > DR.[Date]) OR DR.[Date] < ET.ToDate)
AND SourceId = @TheSourceId
ORDER BY
DR.[Date]
OPTION (MAXRECURSION 1000);
I am sure the above script can be improved and embedded into SDKs, StoredProcedures, Libraries, etc. The intention is not to present the packaged solution but to explain the approach.
Innovator, Beekeeper, Software Dev
4 个月Maybe instead of trying to store all the data accept that the concept of bloat has already been solved in the development of git. Essentially the basics are as follows. Base Object | Original Base Target Object| State Delta| changes Then instead of looking at time based changes you can start looking at business based changes / logical gates to do big delta compressions. Heuristics allow you to access often accessed files and keep them current, Like last X business transactions or certain key actions / functions can be identified. older actions can be sent to ice for low cost storage. As well depending on data changes not every change is important. For example depending on your data you might not want to see 'Old' Audit logs. As they get older than a certain age who cares if a user saw a document or not 5 years ago. By changing your perspective from time to business functions, you transform your data so that you can have intelligent conversations in order to give conceptual actions. This can then add value to discussions on what is needed for business instead of reverting into the technical clerics of what happened and changes the conversation from it is, to this is how people engage with us.