Time Series Data

Time Series Data

A common pattern in Software Development and Data Engineering is storing Time Series Data (a sequence of data points collected or recorded at successive points in time, typically at evenly spaced intervals). I have found the existing patterns lacking.

Why do we need a better approach?

  • The basic software development approaches tend to build the frequency into the structure making querying flued intervals (Hourly, Daily, Monthly, etc.) and point-in-time values difficult
  • There tends to be a ton of duplication, often the DB gets bloated with duplicate values that do not add value e.g. Event Sourcing persisting imutable events regardless of the change, processing these into a timeline can also be very resource intensive
  • Data Engineering has an elegant solution for this (SCD Type 2) that assumes no interruptions or corrections which is not practical in most cases


The Solution

Extend the basic principles of Slowly Changing Dimensions Type 2 (Typically only focus on Add and Append) to make it more practical and usable in the real world with much-needed extended features. This does make the solution more complicated, I would not do much of these in EF without first-class support.

Normal Actions

Add: New items added will start with the EffectiveDate as the FromDate and the ToDate will be NULL (Representing the current active value)

Append: When a new value is added it will fetch the previous record and update the ToDate to the new FromDate scoping its validity period. The new record will be added the same way the original add happened with the ToDate as NULL.

Extended Actions

Delete: You should be able to delete existing entries, this will require you to wire up previous/next records to maintain the integrity of the timeline

Insert: You can insert between existing records allowing for more granular values between existing values. Very importantly, you cannot insert with the same FromDate if that happens you first need to delete the existing value.

Eligibility: Here is the main area of focus and my contribution. You can identify gaps without polluting your data or breaking the data integrity.

Conceptual depiction of how gaps in measurements can be identified


An Example

In the below example, I have written a script that simulates multiple events, how to persist them in SCD with eligibility, and then how to query them fluidly across a larger period.

7 Events were Processed, 3 Records Persisted and 10 Intervals Represented

The periodic view can be extended to include an Unavailable flag with imputed values from adjacent values while indicating that the real values are unavailable. This will depend on the consumer use case but can give you maximum flexibility and insight.

Also in the case of precise measurements, you might want to drop the precision while retaining the accuracy to keep the volume of records lower e.g. reduce decimal places or use a band not the actual value.

DECLARE @baseDate datetime = '2024-03-20';
DECLARE @validityPeriodInDays int = 1;
DECLARE @TheSourceId int = 101;

/* ======== PHASE 1: Populate the Input Data =========== */

-- A queue representing events that occur
DECLARE @eventQueue TABLE (
    EventID int IDENTITY(1,1),
	SourceId int,
    [Value] varchar(200),
    EffectiveDate datetime
);

-- Example events in specific order
INSERT INTO @eventQueue (SourceId, [Value], EffectiveDate) 
VALUES (@TheSourceId, 'A', DATEADD(DAY, 1, @baseDate)),
       (@TheSourceId, 'A', DATEADD(DAY, 4, @baseDate)),
       (@TheSourceId, 'B', DATEADD(DAY, 5, @baseDate)),
	   (@TheSourceId, 'B', DATEADD(DAY, 6, @baseDate)),
       (@TheSourceId, 'B', DATEADD(DAY, 7, @baseDate)),
	   (@TheSourceId, 'B', DATEADD(DAY, 8, @baseDate)),
	   (@TheSourceId, 'B', DATEADD(DAY, 9, @baseDate));

-- Show the input events
select * from @eventQueue

-- Table for risk score history (The timeseries persisted data)
DECLARE @eventTimeseries TABLE (
    Id int IDENTITY(1,1),
	SourceId int,
    [Value] varchar(200),
	EligibleUntil datetime,
    FromDate datetime,
    ToDate datetime NULL
);

-- Process the event queue items one at a time (Simulate Real world updates)
DECLARE @rowIndex int = 1;
DECLARE @totalRows int;

SELECT @totalRows = COUNT(*) FROM @eventQueue;

WHILE @rowIndex <= @totalRows
BEGIN

	/* ======== PHASE 2: Process the Timeseries =========== */

	-- Fetch the current event
	DECLARE @Value varchar(200);
	DECLARE @EffectiveDate datetime;
	DECLARE @SourceId int;

	SELECT @Value = [Value], @EffectiveDate = EffectiveDate, @SourceId = SourceId
	FROM (
		SELECT [Value], EffectiveDate, SourceId, ROW_NUMBER() OVER (ORDER BY EffectiveDate) AS RowNum
		FROM @eventQueue
	) AS EventData
	WHERE RowNum = @rowIndex;

	DECLARE @previousId int;
	DECLARE @previousFromDate datetime;
	DECLARE @previousToDate datetime;
	DECLARE @previousEligibleUntil datetime;
	DECLARE @currentId int;
	DECLARE @currentValue varchar(200);
	DECLARE @currentEligibleUntil datetime;
	DECLARE @currentFromDate datetime;
	DECLARE @currentToDate datetime;

	DECLARE @DoExtendEligibility bit = 0;
	DECLARE @DoAppendNewRecord bit = 0;
	DECLARE @DoClosePreviousRecord bit = 0;
	DECLARE @DoCloseCurrentRecord bit = 0;

	-- Build the row order
	WITH OrderedRows AS (
		SELECT
			Id,	SourceId, [Value], EligibleUntil, FromDate,	ToDate,
			ROW_NUMBER() OVER (PARTITION BY SourceId ORDER BY FromDate DESC) AS RowNum
		FROM @eventTimeseries
		WHERE SourceId = @SourceId 
		  AND @EffectiveDate >= FromDate 
		  AND ((ToDate IS NOT NULL AND @EffectiveDate <= ToDate) OR (ToDate IS NULL))
	)

	-- Select the current and previous rows (RowNum 1 = current, RowNum 2 = previous)
	SELECT 
		-- Current row values (RowNum = 1)
		@currentId = CASE WHEN RowNum = 1 THEN Id ELSE @currentId END,
		@currentValue = CASE WHEN RowNum = 1 THEN [Value] ELSE @currentValue END,
		@currentFromDate = CASE WHEN RowNum = 1 THEN FromDate ELSE @currentFromDate END,
		@currentToDate = CASE WHEN RowNum = 1 THEN ToDate ELSE @currentToDate END,
		@currentEligibleUntil = CASE WHEN RowNum = 1 THEN EligibleUntil ELSE @currentEligibleUntil END,
		-- Previous row values (RowNum = 2)
		@previousId = CASE WHEN RowNum = 2 THEN Id ELSE @previousId END,
		@previousFromDate = CASE WHEN RowNum = 2 THEN FromDate ELSE @previousFromDate END,
		@previousToDate = CASE WHEN RowNum = 2 THEN ToDate ELSE @previousToDate END,
		@previousEligibleUntil = CASE WHEN RowNum = 2 THEN EligibleUntil ELSE @previousEligibleUntil END

	FROM OrderedRows
	WHERE RowNum IN (1, 2);
	
	-- Extend: Current value is same and still eligible
	IF (@currentId IS NOT NULL AND @Value = @currentValue AND @EffectiveDate <= @currentEligibleUntil)
	BEGIN
		SET @DoExtendEligibility = 1;		
		Print CONCAT(CONVERT(varchar(10), @EffectiveDate) ,' - Extend')
	END
	-- New: There is no existing records
	ELSE IF (@currentId IS NULL AND @previousId IS NULL) 
	BEGIN			
		SET @DoAppendNewRecord = 1;			
		Print CONCAT(CONVERT(varchar(10), @EffectiveDate) ,' - New')
	END
	-- Append: Previous is tip of timeline
	ELSE IF (@currentId IS NULL AND @previousId IS NOT NULL AND @previousToDate IS NULL)
	BEGIN
		SET @DoClosePreviousRecord = 1;
		SET @DoAppendNewRecord = 1;		
		Print CONCAT(CONVERT(varchar(10), @EffectiveDate) ,' - Append on previous')
	END
	-- Append: Current is tip of timeline
	ELSE IF (@currentId IS NOT NULL AND @currentToDate IS NULL)
	BEGIN
		SET @DoCloseCurrentRecord = 1;
		SET @DoAppendNewRecord = 1;	
		Print CONCAT(CONVERT(varchar(10), @EffectiveDate) ,' - Append on current - ', @currentValue, ' - ', @Value, ' - ', CONVERT(varchar(10), @currentEligibleUntil))
	END
	-- Insert: Find and existing record for this period and split it into two

	IF (@DoExtendEligibility = 1)
	BEGIN	
		UPDATE @eventTimeseries 
		SET EligibleUntil = DATEADD(Day, @validityPeriodInDays, @EffectiveDate)
		WHERE Id = @currentId;
	END

	IF (@DoAppendNewRecord = 1)
	BEGIN
		INSERT INTO @eventTimeseries (SourceId, [Value], EligibleUntil, FromDate, ToDate)
			VALUES (@SourceId, @Value, DATEADD(Day, @validityPeriodInDays, @EffectiveDate), @EffectiveDate, NULL)
	END

	IF (@DoClosePreviousRecord = 1)
	BEGIN			
		IF (@previousEligibleUntil >= @EffectiveDate)
		BEGIN
			-- Close and link to current from date
			UPDATE @eventTimeseries 
			SET ToDate = @EffectiveDate 
			WHERE Id = @previousId;
		END
		ELSE
		BEGIN
			-- Close and link to max eligible time
			UPDATE @eventTimeseries 
			SET ToDate = @previousEligibleUntil 
			WHERE Id = @previousId;
		END			
	END

	IF (@DoCloseCurrentRecord = 1)
	BEGIN
		IF (@currentEligibleUntil >= @EffectiveDate)
		BEGIN
			-- Close and link to current from date
			UPDATE @eventTimeseries 
			SET ToDate = @EffectiveDate 
			WHERE Id = @currentId;
		END
		ELSE
		BEGIN
			-- Close and link to max eligible time
			UPDATE @eventTimeseries 
			SET ToDate = @currentEligibleUntil 
			WHERE Id = @currentId;
		END	
	END

	-- Increment to process the next row
	SET @rowIndex = @rowIndex + 1;
END

-- View the raw data stored in time series
SELECT * FROM @eventTimeseries;

/* ======== PHASE 3: Show the historic data over time =========== */

-- View the daily values from the timeseries
DECLARE @StartDate datetime = DATEADD(DAY, 1, @baseDate); 
DECLARE @EndDate datetime = DATEADD(DAY, 10, @baseDate); 

-- Tally Table to generate a sequence of dates
WITH DateRange AS (
    SELECT @StartDate AS [Date]
    UNION ALL
    SELECT DATEADD(DAY, 1, [Date])
    FROM DateRange
    WHERE [Date] < @EndDate
)
-- Final table to match dates with the corresponding SCD values
SELECT 
    DR.[Date],
    @TheSourceId as SourceId,
    ET.[Value]
FROM 
    DateRange DR
LEFT JOIN 
    @eventTimeseries ET 
    ON DR.[Date] >= ET.FromDate 
    AND ((ET.ToDate IS NULL AND ET.EligibleUntil > DR.[Date]) OR DR.[Date] < ET.ToDate)
	AND SourceId = @TheSourceId
ORDER BY 
    DR.[Date]
OPTION (MAXRECURSION 1000);        

I am sure the above script can be improved and embedded into SDKs, StoredProcedures, Libraries, etc. The intention is not to present the packaged solution but to explain the approach.









Joan Pearl

Innovator, Beekeeper, Software Dev

4 个月

Maybe instead of trying to store all the data accept that the concept of bloat has already been solved in the development of git. Essentially the basics are as follows. Base Object | Original Base Target Object| State Delta| changes Then instead of looking at time based changes you can start looking at business based changes / logical gates to do big delta compressions. Heuristics allow you to access often accessed files and keep them current, Like last X business transactions or certain key actions / functions can be identified. older actions can be sent to ice for low cost storage. As well depending on data changes not every change is important. For example depending on your data you might not want to see 'Old' Audit logs. As they get older than a certain age who cares if a user saw a document or not 5 years ago. By changing your perspective from time to business functions, you transform your data so that you can have intelligent conversations in order to give conceptual actions. This can then add value to discussions on what is needed for business instead of reverting into the technical clerics of what happened and changes the conversation from it is, to this is how people engage with us.

回复

要查看或添加评论,请登录

Marius Vorster的更多文章

  • Misalignment of Salary Expectations Between Employers and Employees

    Misalignment of Salary Expectations Between Employers and Employees

    Salaries are often a contentious point, and it's easy for people to feel underpaid. Sadly, we often have our toughest…

    4 条评论
  • The Agile Strategy Gap

    The Agile Strategy Gap

    Products build on Agile practices are less likely to fail. I have seen multiple projects doomed by incorrect initial…

  • Why use Mental Models?

    Why use Mental Models?

    Mental models help us understand how the world works, they are heuristic cognitive frameworks that allow us to identify…

    5 条评论
  • School of Life

    School of Life

    It is becoming increasingly important to keep growing effectively, not just ad hoc or out of necessity but…

    2 条评论
  • Develop These Skills Early

    Develop These Skills Early

    Some skills are worth investing in early in your life, they propagate well into different areas, and their benefits…

    3 条评论
  • The Two-Sided Organization

    The Two-Sided Organization

    For many development teams and Agilists, it is difficult to think that their principles and approaches will not scale…

    1 条评论
  • What are you thinking about?

    What are you thinking about?

    At the core of this question is the idea that (what) and (how) we think in the present have a significant influence on…

  • That Which You Already Almost Know

    That Which You Already Almost Know

    Knowledge builds upon knowledge, this is well understood from a societal perspective. Looking at approaches like The…

    1 条评论
  • Snuggle Down With Uncertainty

    Snuggle Down With Uncertainty

    We are living in unprecedented times, COVID-19 has shuffled the board, players, and the rules leaving us with little…

    3 条评论
  • Supercharging Teams

    Supercharging Teams

    Teams can grow into powerful self-organizing units that produce greater work together than the sum of their parts. The…

    1 条评论

社区洞察

其他会员也浏览了