Modeling Longitudinal/Time Series/Sequential Data in Neo4j
Example output of Time Series data from a fictious patient journey using Neo4j and Bloom

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

…and Why Analyzing It Is a Graph Problem

One of the most common modeling questions we get asked is how to model time series/sequential/longitudinal data in Neo4j.??When we talk about time series or longitudinal data, we are referring to any data entities that have:

  • A single date or datetime property that represents when the event happened or will happen.
  • A pair of dates/datetimes that represent the start effective and end effective dates
  • No property – just a relationship that infers a sequence – for example parent -> child

And we wish to look at the data from a sequential view point.

Time Series in ER Model

I think the reason we get this question a lot is that while RDBMS systems may have internal triggers to maintain the data – the underlying relational model implementation doesn’t describe it well – or is limited at best.??For example, consider the following two sample SQL tables:

/* example 1 */
create table acct_transactions (
????????? transaction_id????????????? varchar(50)????????????????? not null,
????????? account_id??????????????????char(20)???????????????????? not null,
????????? amount??????????????????????numeric(10,2)??????????????  not null,
????????? transaction_datetime??????? datetime???????????????????? not null,
????????? transaction_account???????? char(20)???????????????????? not null,
????????? transaction_bank??????????? char(9)????????????????????? not null
)

?
/* example 2 */
create table insurance_policy (
????????? policy_id???????????????????char(20)???????????????????? not null,
????????? policy_type?????????????????varchar(30)????????????????? not null,
????????? startEffective????????????? date???????????????????????? not null,
????????? endEffective??????????????? date???????????????????????? not null,
????????? autorenewal?????????????????boolean????????????????????? not null
)        

Consider example 1.?The primary key likely is a combination of account_id + transaction_id.?Using normal primary key/foreign key relationships, it is then difficult to create a relationship that would “tie” the transactions into a sequence.??Even if transaction_datetime was part of the key, the problem is that consider the actual data values – one transaction may have occurred at “2023-07-26 10:23:24.234” and the next transaction may have occurred at “2023-07-26 10:26:12.567” – the issue is that with normal relational models, the primary/foreign key relationship keys have to have the same values….which it obviously doesn’t in this case.

This gets a little more complex when you have an example as in example 2.??The primary key is likely the policy_id + policy_type + startEffective – although if there can be overlapping instances, it might also include endEffective.?The issue again is that the values don’t align – i.e., the startEffective for a policy might be “2023-03-01” and the endEffective of “2023-08-30” and the next renewal begins on “2023-09-01” and ends on “2024-02-29”.?Again, since the values differ – it doesn’t work with standard primary/foreign key relationships.?Even if you adjusted the endEffective by one day to align with the previous startEffective value – the values are in different columns so it still wouldn’t work.?In addition, sometimes (i.e., a gap in insurance) the dates won’t be the same but there is still a sequence.

As a result, querying sequences in SQL often involves sorting by date fields, or using min()/max() functions on date fields to find the first/last entity in a sequence, or using SQL Window query functionality.??It also often results in adding columns to tables such as next_transaction_id or similar so that we can form some relational implementation. This works for a few - but when there are many different possible sequences (as we will see below), it doesn't scale well.

With Neo4j, the modeling time series is a standard solution for us – it is simpler because our relationships are not restricted to node property values. ???However, there are some considerations as supporting analytics often requires some flexibility that a typical OLTP model may not be concerned with and very often multiple overlays.

The Basics of Modeling Time Series in Neo4j

If you take a graph data modeling class from Neo4j, the common solution positioned for time series data is to connect the discrete events with NEXT relationships.?This also can include a FIRST, LAST and CURRENT relationship pointers if these facilitate query performance.??Consider the highlighted relationships in the following model fragment:

No alt text provided for this image
Example Credit Card Application Graph Data Model with Time Series Data

If we start in the upper right with the relationships highlighted in yellow (CURRENT_STATUS, NEXT_STATUS), this is a classic – but minimalist implementation.?At any point, one can retrieve every status an application was in (traversing the APPLICATION_STATUS) but also very rapidly find the current status.?Operations analysis folks looking to optimize the application flow can also see the sequence of statuses as well as how long (assuming status nodes have a timestamp) was spent on each on via traversing the NEXT_STATUS backwards from the CURRENT_STATUS.??This does require a bit of logic in the application using cypher such as the following in a single transaction:

// single transaction..in fact, this is a single query in cypher!
MATCH (a:Application)-[old:CURRENT_STATUS]->(prev:Status)
CREATE (new:Status)
MERGE (a)-[:APPLICATION_STATUS]->(new)
MERGE (prev)-[:NEXT_STATUS]->(new)
MERGE (a)-[:CURRENT_STATUS]->(new)
DELETE old        

Not too difficult.

Now, if we look at the Application -> MerchantTxn relationships, we see there are 4 – 3 of which are highlighted in cyan:?

  • APPLICATION_TXN which connects every Application with each MerchantTxn
  • FIRST_TXN which points to the first MerchantTxn
  • LAST_TXN which points to the last MerchantTxn
  • NEXT_TXN which connects the transactions in sequence

We might do this for a variety of reasons.?Years ago, one of the patterns for credit card fraud was that the first transaction after activation was at a gas station.?These days, the last transaction might have been a $0.99 test charge.??The sequence can also be useful for tracing back from a group of fraudulent transactions to a common merchant where the credit card might have been exposed.

And then we have the NEXT_DEVICE_TXN highlighted in pink.??This one could be used to detect fraud by seeing if the next credit card transaction from a given device was from different accounts or simply analyze how often ApplePay? is used from a customer’s mobile device.??This is the point that graph databases offer some interesting flexibility in modeling time series as you can connect the series in from different perspectives.?In my blog a few weeks ago on graph engineering for time series?(https://www.dhirubhai.net/pulse/graph-feature-engineering-longitudinal-data-aka-time-series-tallman/), I pointed out that a patient journey from an analytics perspective could have different “NEXT” relationships – e.g. NEXT_VISIT_SAME_DIAGNOSIS, etc.

We also can differentiate between past and future events – consider the following fictious data center build out plan:

No alt text provided for this image
Time Series with Future Events

The relationships highlighted in yellow (and unhighlighted for FIRST/LAST) provide the time series for current and past changes.?Those highlighted in pink provide the time series for future changes – and allow for comparison for planned vs. actual once changes have been deployed.

The point is that how you link or connect events together really depends on the problems you are trying to solve.

Why It Is a Graph Problem

The Hierarchy of Sequential Data

There is an even further dimension.?If we consider any form of data that is sequential – whether time series or simply some sequence of processing irrespective of time – the most likely case is that there is a hierarchy involved that is often not represented in OLTP systems.??Let’s consider some common longitudinal/time series data use cases:

  • Customer/Patient Journey
  • Shipping Logistics
  • Job Processing/Product Manufacturing
  • Supply Chain/Manufacturing Design
  • Sales Transactions

There are probably others – but those are enough for our discussion today.??Now, during OLTP processing, we really don’t care about any hierarchy.??For example, with sales transactions, we simply want to get the customer checked out as rapidly as possible.???Similarly, with shipping logistics – when a package moves from location A to location B, we absolutely need to have the package scan events recorded as fast as possible – especially given the ever-increasing volumes of shipments.

But for analytics – there is a hierarchy – and the exact shape of that hierarchy often depends on the perspective of the persona the analysis is for – resulting in a hierarchical matrix of longitudinal flows depending on the abstraction the persona is seeking.?

As an example hierarchy, consider the ubiquitous credit card account.

No alt text provided for this image
Example Credit Card Hierarchy of time series data

Transactions may not seem to have a series, but considering a standard deposit/checking account, it is more inferred - e.g. the deposit has to be applied before withdrawals. Even on credit cards, it is similar - basically the timestamp of the transaction forms the time series - from which patterned behavior can be used for fraud - e.g. credit card swiped twice in a row within 1 minute or something similar.

Statements also have an obvious sequence - with start/end dates for the various monthly billing. Obviously each statement contains a subset of all the transactions that are new for that billing period (including payments - which are a type of transaction that may also have a second label - a topic for another discussion).

It can be a bit more complex - essentially a 3 dimensional hierarchy that for lack of better terminology I think of as a "matrix hierarchy". Below are two examples of this.

BOM Explosion vs. Hierarchy/Matrix and Sequential Data

I don’t claim to be a manufacturing domain expert – but if you consider the classic BOM explosion problem, the query often is “I have this ?-inch bolt that I am low in stock for – what products is it used in” – this is a very valid question from a pure time critical query for a current state.??Users of RDBMS often hate BOM explosions as the most common implementation is a series of UNION ALL statements with each statement a progression of additional self-joins to get to the next level.

But BOM explosion is just 1 out of many different business questions for manufacturing analytics.?Most of the time, data analysts often looks at things from a broader perspective – and quite often from the business point of view in hierarchies.??Consider the following mock representation of the hierarchies for an automotive manufacturer (or any - I like to use cars as many of us are familiar with them):

No alt text provided for this image
Anecdotal Hierarchy of Automotive Assembly

Now that is perhaps the hierarchy of components – but the actual production is on an assembly line with certain tasks ahead of others.?In addition, some components or parts may come directly from suppliers already assembled – and enter the assembly line at various points.??In most cases, then, an assembly line is broken up into multiple stages.?Each stage has certain steps as part of manufacturing the next higher level in the hierarchy. ??Consider the following example of such a hierarchy:

No alt text provided for this image
Sample Product Production Hierarchy of Sequence Data

Consider the following example showing the hierarchy of a mythical assembly line stage for the chassis line with engine assembly as sub-process and the assembly of the cylinder head below that:

No alt text provided for this image
An example assembly sequence and hierarchical levels of automobile manufacturing

In the assembly line there is a sequence of steps at each level.??Along the way, the manufacturing process may move from location to location.??A few years ago, the automobile manufacturers in North America stated that the typical automobile went back and forth between the US and Canadian factories somewhere between 7 and 14 times.??This adds transfer times, etc. to the manufacturing process.

Yes, there are the standard BOM queries – “what all products use this screw?” But it likely is not the real question. The real question more likely is “Given the shortage of this screw, what are the parts impacted, the assembly lines impacted, etc.?” Then often a follow-on question: “What models/trims can be produced in the meantime that has the least impact on the assembly line or prevents idling resources?”

Similarly, when someone is analyzing the assembly line for production rates vs. demand (“demand abstraction”), they probably don’t care about the components or raw materials initially – they likely will be taking a particular model and looking at the aggregate values for production rates at the major assembly stages and seeing if the rate meets demand – and if not, identifying the particular stage in the assembly line where there is a hold-up and then drilling down to the next level(s) to identify a cause.

On the other hand, someone doing defect analysis (“manufacturing defect abstraction”) is likely looking at the flow through the machines used – as well as the personnel involved and suppliers to see if the defects are due to poor training, a particular machine, or a supplier – so to them, the actual parts/components almost are ancillary to their perspective.

Then there are the operations analysis folks that are looking at cost optimizations of that very expensive automated manufacturing equipment and are more concerned with scheduling – “resource optimization abstraction”.??They may not be concerned about the actual parts/components either but more concerned with which machines can do which task, the time it takes and how they can most efficiently construct machine sequences so that is a steady flow of manufacturing vs. process hold-ups as parts wait for machine availability.

Adding to the above, there typically are a variety of different brands of machines with different overlapping functionalities in different locations.?Competing with the optimization aspects of avoiding idle resources often is the “shortest path” type questions: “Which way is the cheapest way to produce X?”?“Which way is the fastest method?”?

Then there is the resiliency factor: “If a tornado hits plant X, which plants with the correct tooling have the surge capacity to take over production that are reachable by my suppliers or their alternates in a timely fashion?”

The important aspect here is that with time series data, there are often different levels of abstraction and connections between the levels.??Further, as we drill down from the top, there may be different perspectives of the definition of the next level – e.g., the machine sequence vs. the component assembly sequence.?These perspectives also have relationships – the machine to the actual component, etc.??The result is that many time series problems are actually a pyramid of layers and each layer gets progressively wider and more detailed with parallel interconnected paths through the sequence based on the persona’s perspective of the problem.

No alt text provided for this image
Product Manufacturing Matrix Pyramid of Time Series Connections and Levels

Obviously, not all the possible connections are illustrated above – but from just the complexity of what is shown, you can easily see how a graph database might be much easier to analyze the different business questions vs. RDBMS or document stores such as MongoDB.

Patient Journey Hierarchy

It's not just manufacturing that can have such a hierarchy of time series data. Let’s take patient journey for example.?Consider the following data model:

No alt text provided for this image
Example Patient 360 Data Model

A patient journey at the very detailed lowest hierarchy is simply a sequence of “encounters”.?It is critical for OLTP processing as well as analytics for some aspects, including:

  • Health care insurer – patient episode claims processing.
  • Family doctor – treatment plan for current complaint

But as I pointed out in the blog mentioned earlier – a specialist such as an orthopedic doctor probably doesn’t care about encounters unrelated to the specific complaint the patient is seeing them for – e.g., knee pain.?Instead, they would prefer to see a “filtered” view of the patient journey just based on that ICD10/diagnosis code.?That still can be viewed as at the lowest level – just a “filtered” perspective.??As I pointed out though, a pharmaceutical company likely doesn’t care about individual patients – but instead may be looking at a set of “disease journeys” by consolidating/aggregating the patient journey along the lines of “diagnosis”.???Going a step further, a clinical researcher might also take the diagnosis aggregation and look at related symptoms as a study of disease progression.??So, it is possible to have a matrix pyramid similar to:

No alt text provided for this image
HealthCare Time Series Data and Hierarchical Levels

At the higher level – one can take a look at the providers for a specific disease/diagnosis within the insurance carrier’s network as well as which drugs and regimen covered by insurance to best treat the patient.

Net, Net – Modeling Sequences & Time Series in Neo4j

There are a lot of things to consider, but the first set of questions you should begin with are to consider the following aspects:

  • Which nodes form the fundamental sequence?
  • Do we need to rapidly find the current/last/first in the series?
  • Do we need to compare future (planned) and past/present in the series?
  • Do we need to also support “filtered” sequences based on specific characteristics?
  • Is there a hierarchy to the sequence that the business uses to view the sequence from as an aggregation of phases/processes?
  • Do we have associated sequences that are linked – e.g., machines used in sequence & product assembly sequences?

Once you start answering these, the model will start to take its initial shape.?Don’t be surprised as new questions come in, that the model changes slightly by adding new perspectives – but by understanding the fundamentals in the above questions, it should be easily adaptable.

要查看或添加评论,请登录

Jeff Tallman的更多文章

社区洞察

其他会员也浏览了