登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

Jeff Tallman

Principal Sales Engineer at Neo4j

发布日期: 2022年4月4日

+ 关注

If you are a student of DBMS technologies, then you know that there are four basic join types:

Nested Loop (NLJ) – Uses an index to traverse from a parent node to all the children.? Works well when the number of children per parent is low cardinality and the number of parents is small.?E.g. the typical OLTP query.
(Sort) Merge (MJ/SMJ)– Uses a sorted list to find large numbers of children for a decent sized number of parents – commonly used in reporting and can be faster (especially when no sorting is required) than nested loop joins.
Hash Joins (HJ) – Used when really large numbers of parents/children being searched for – involves creating a Build/Probe list in memory on join keys and using hash values – typically when neither table has an effective index for the query.
Cartesian products ….

Wait a second you say…. “Cartesian Products are NOT a join strategy – in fact we try to avoid them!”??Well……that is not quite true.??In fact, 2 common use cases of cartesian products are lateral joins (two one row results) as well as a common fact table processing technique is to do a deliberate Cartesian of all the dimensions in the query and then do a hash join to the fact table.

Essentially, Graph DBMS’s replace the need for NLJ by replacing the index with a graph “edge” (relationship) that allows direct traversal from one entity to another without traversing multiple levels of an index.??A concept called “Index Free Adjacency” in the Graph DBMS world.

This opens up some very interesting query situations as this is commonly exploited in “multi-hop” graph traversals as there are very often needs to quickly traverse down a tree of a graph from the top to the bottom.??This is often supported in GDBMS through the use of “variable length” path operators – such as the *[range] notation in cypher ala:

MATCH circleOfFriends=(p1:Person)-[:IS_FRIENDS_WITH*1..5]->(p2:Person)
WHERE p1.name = “Jeff”
RETURN circleOfFriends

The above instructs the GDBMS to traverse from the Person node whose name is “Jeff” along the “IS_FRIENDS_WITH” relationship type from 1 to 5 hops out and return all the paths found.??In a RDBMS using SQL (or at least one that doesn’t support CONNECT BY), you would write the above as series of UNION (ALL) operators such as:

// pseudo SQL code (didn't test so there may be errors)
SELECT 
FROM Person p1
??????? JOIN FRIENDS f1 ON p1.personID=f1.srcFriendID
??????? JOIN Person p2 ON f1.tgtFriendID=p2.friendID
WHERE p1.name = “Jeff”
UNION ALL
SELECT …
FROM Person p1
??????? JOIN FRIENDS f1 ON p1.personID=f1.srcFriendID
??????? JOIN Person p2 ON f1.tgtFriendID=p2.friendID
??????? JOIN Friends f2 ON p2.personID=f2.srcFriendID
??????? JOIN Person p3 ON f2.tgtFriendID=p3.personID
WHERE p1.name = “Jeff”
UNION ALL
SELECT …
FROM Person p1
??????? JOIN FRIENDS f1 ON p1.personID=f1.srcFriendID
??????? JOIN Person p2 ON f1.tgtFriendID=p2.friendID
??????? JOIN Friends f2 ON p2.personID=f2.srcFriendID
??????? JOIN Person p3 ON f2.tgtFriendID=p3.personID
??????? JOIN Friends f3 ON p3.personID=f3.srcFriendID
??????? JOIN Person p4 ON f3.tgtFriendID=p4.personID
WHERE p1.name = “Jeff”
……

Yyyyiiiieeee….it starts to get ugly.??We've all been there. Cut & paste coding and you hope you don’t miss something.?Now, this is where things get a bit challenging.??What if you don’t know how many traversals you need to make???In SQL, you couldn’t do it as you can’t do an infinite number of UNION ALL’s – in fact after some number, you’d likely exceed the stack for even just the query command buffer (often 64KB).?Cypher supports this by not requiring a range, so you could have “simplified” the above by:

MATCH circleOfFriends=(p1:Person)-[:IS_FRIENDS_WITH*]->(p2:Person
WHERE p1.name = “Jeff”
RETURN circleOfFriends)

Now, this is where life gets frustrating.??Sometimes, we hear from customers that such queries “never return” – aka the “Energizer Bunny” query.??It just keeps going.??

Let’s explore why.

Now, let’s understand a concept call “Trail” vs. “Walk” and how Relationship Isomorphism plays a role here.??Consider the following graph:

Now, let’s say we want to find all the paths from A->Z.??In Cypher, we’d simply say:

MATCH p=(:A)-[*]->(:Z)
RETURN p

The result would be a small list:

1.??????A->B->C->D->E->Z

2.??????A->F->G->H->I->Z

3.??????A->J->K->L->M->Z

Yea!??Now let’s complicate the graph a smidge….cause real life is never so simple:

Some of you quickly spotted the loops at the top. Now if we run the same cypher:

MATCH p=(:A)-[*]->(:Z)
RETURN p

We should get the results:

1.??????A->B->C->D->E->Z

2.??????A->F->G->H->I->Z

3.??????A->J->K->L->M->Z

4.??????A->B->C->N->O->P->C->D->E->Z

5.??????A->B->C->D->E->Q->R->S->E->Z

6.??????A->B->C->N->O->P->C->D->E->Q->R->S->E->Z

7.??????A->B->G->H->I->Z

8.??????A->B->G->H->M->Z

9.??????A->F->G->H->M->Z

10.??A->J->K->H->I->Z

11.??A->J->K->H->M->Z

Notice we don’t have any infinite looping.??It would be tempting to think that with the C->N->O->P->C as well as the E->Q->R->S->E loops in the above, that a graph traversal would end up spinning forever.?If we did a typical graph “walk”, we would.??Hmmmm….so how would we avoid the problem.??One quick thinking person might say “well the answer is obvious – only traverse each node one time!”??Unfortunately, this wouldn’t work out well.?As a result, paths 4-6 wouldn’t get returned as path #1 would have already traversed node “C” and “E”.?In fact, paths 7-11 also wouldn’t get returned as paths 2 and 3 also would have traversed some of the same nodes.

This is where relationship isomorphism comes in.??It is the concept behind a graph traversal technique called “trail” (vs. “walk”) which rather than exploring every branch from every node regardless (as “walk” does) it instead only traverses a single relationship one time per path generation.?Thus the infinite loops would be blocked as the relationships C->N and E->Q can only be traversed once per path search.??

Sooooo…then why the “Energizer Bunny” queries…..

Well…..think of what happens in the above graphs if I add a relationship from H->E.??How many more paths are created????Answer – at least 7.? Soooo...the more "highly connected" a graph is .....the more paths.

At the extreme scale, let’s take on our “Friend of a friend” graph (with a bit of color coding for ease of discussion):

If we run the query:

MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*..2]->(f:Person)
RETURN length(p) as pathLength, count(p) as numPaths
ORDER BY pathLength
;

We get the results (expressed as comments here - to make things simpler later):

// Started streaming 2 records after 2 ms and completed after 23 ms.
//╒════════════╤══════════╕
//│"pathLength"│"numPaths"│
//╞════════════╪══════════╡
//│1?????????? │4???????? │
//├────────────┼──────────┤
//│2?????????? │32??????? │
//└────────────┴──────────┘

Okay – the first one we understand – there are precisely 4 people one hop out.??But 32 for 2 hops????Hmmm….looking at the graph we see that there are:

4 people (in corner) each of one 2 hop path (yellow + blue)
8 people (purple) have at least 2 paths for a total of 16 – for example, I can get from Jeff to Amy via both Dave and Pam with 2 hops
The 4 close friends (1 hop away) all can have 2 hop paths ala Jeff -> Dave via Mary or Pam (over orange links).?This is another 8 paths

So that’s 4+16+8 = 28.

Hmmmm….we are missing 4.

Ahhh….but if we think about this a graph perspective, we also have 2 hop results from our close friends such as Jeff -> Dave -> Jeff.??Weird, but a still a valid 2 hop graph output that does not traverse the same relationship twice.???Sooo….that’s the “missing” 4.

Now then, let’s make the query a bit more fun:

MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*..10]->(f:Person)
RETURN length(p) as pathLength, count(p) as numPaths
ORDER BY pathLength
;

// Started streaming 10 records after 1 ms and completed after 102656 ms.
//╒════════════╤══════════╕
//│"pathLength"│"numPaths"│
//╞════════════╪══════════╡
//│1?????????? │4???????? │
//├────────────┼──────────┤
//│2?????????? │32??????? │
//├────────────┼──────────┤
//│3?????????? │168?????? │
//├────────────┼──────────┤
//│4?????????? │940?????? │
//├────────────┼──────────┤
//│5?????????? │5024????? │
//├────────────┼──────────┤
//│6?????????? │26780???? │
//├────────────┼──────────┤
//│7?????????? │139996??? │
//├────────────┼──────────┤
//│8?????????? │725556??? │
//├────────────┼──────────┤
//│9?????????? │3709104?? │
//├────────────┼──────────┤
//│10????????? │18761136? │
//└────────────┴──────────┘

Okayyyyy….we went from 23ms to 1.7 minutes (102 seconds).?But notice that we went to a total of ~23 million paths – increasing ~1 order of magnitude per increase in hop to a peak of ~<19 million paths.??From a standpoint of performance expectations, we are executing 23m/102 = 2.2 million traversals per second on a virtual CPU (1/2 a core) which is in-line with our 4-5 million traversals per core typical performance expectation.??However, it also shows how things quick expand.??Let’s take it a couple of steps further:

MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*..11]->(f:Person)
RETURN count(p) as numPaths
;

// Started streaming 1 records after 10 ms and completed after 458178 ms.
// 117,122,572
// 4.5x longer, 5x more paths vs 10 hops; 7m38s


?

MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*..12]->(f:Person)
RETURN count(p) as numPaths
;

// Started streaming 1 records after 10 ms and completed after 2468528 ms.
// 580,371,748
// 5x longer, 5x more paths vs 11 hops; 41m08s

?

?
MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*]->(f:Person)
RETURN length(p) as pathLength, count(p) as numPaths
ORDER BY pathLength
;

// Killed after >2 hours of running

We jump from 1.7 to 7.5 minutes and then to 41 minutes (on somewhat a linear progression)…..and then we get “Energizer Bunny”.??If we plot our original 10 hop results on a line, we would get something like:

Sooo…we can see using a logarithmic scale, it is extremely linear (although diverging slightly).?If we extrapolate, then, if we ran 13 hops, we could expect a 5x increase in runtime to 3.5 hours (ugh) and 14 hops would be ~17 hours…..ouch!??While there likely is a finite number of paths, the number is probably way beyond our patience of waiting for results.??

Soooooo….yepper…..good way to saturate a CPU.??Could we run it in parallel???? Absolutely!?In the past, you could easily do this via the APOC library calls such as apoc.cypher.parallel – but the plan is that future version of Neo4j will support parallel query natively by specifying a parallel runtime engine in the query. ??Today, in addition to APOC, you could code a stored procedure to use parallel threads within Neo4j quite easily.? But…..what degree of parallelism would give us good results???Think about it – even with 10-way parallelism, a 14 hop query would still likely take >1 hour to get results and figuring out how to do the parallel branches without repeating traversals would get to be ….a tad fun.

The net of this discussion is three-fold:

When working with highly-connected graphs, be very careful using unbounded variable length paths – or even large ranges (such as 10 hops) as the number of paths returned could be huge
If you need to deeply traverse such a graph, you may need to use a bit of custom logic to avoid repeatedly traversing subgraphs
Brute force parallelism likely is not the answer

I mention #2 because the reason I got into this mess to start with was a customer set us a job dependency graph from a job scheduling environment – and due to the fact that some jobs had loops where execution ran until some conditions were met – there were multiple instances of “infinite loops” as well as large numbers of interconnections, etc.?The result was I wrote a stored procedure that implemented a breadth first search (vs. a depth first search) tree-traversal that used node isomorphism per path instead and by returning a list of unique nodes/paths the full tree could be visualized.??Some may say …. “but didn’t you say node isomorphism wouldn’t return all the results?”…..well true, but if we look at our earlier example of A->Z, using node isomorphism, we’d get the results:

1.??????A->B->C->D->E->Z

2.??????A->F->G->H->I->Z

3.??????A->J->K->L->M->Z

4.??????A->B->C->N->O->P

5.??????A->B->C->D->E->Q->R->S

6.??????A->B->G

7.??????A->J->K

Several of these paths do not end up at the final desired destination of Z.?However, if displaying them with a visualization tool such as Neo4j Bloom – all the connections would in fact be there – and the results displayed in sub-second even with 40+ hops.??And I did so on a single execution thread. ?I originally thought I would need to use parallel threads……but retrieving >9000 jobs with 40+ hops came back in something like 100ms……and trust me – it took my poor laptop a lot longer than that to try to render the results on the screen….so parallelism would not have achieved anything.

If I would have forced #3 without the smarter logic the way our competition suggests…..wellllllll….with 40 hops, it’d still be running and some data center somewhere where our cloud systems are running would be turning lotsa water into steam keeping things cool….and rivaling bitcoin’s excessive energy requirements to produce 1 result.

Tidbit for all you crypto coin aficionado’s.??Bitcoin mining consumes 91 terawatt-hours of electricity annually……and it doesn’t even have real widespread usage yet.??That’s more than the entire energy consumption of all of Google (as per?https://www.businessinsider.com/bitcoin-mining-electricity-usage-more-than-google-2021-9#:~:text=Bitcoin%20mining%20consumes%20around%2091,from%20just%20five%20years%20ago. )

Obviously, that presents a real challenge for massive public adoption.???But even on a smaller scale, it points out that massively parallel processing also quickly hits adoption limits within a corporation as it becomes financially unsustainable.

Work smarter.? Not harder.

Prasad Gnana Pandhigunta

Director at BlackRock

2 年

There is no better than you on RDBMS joins, Jeff Tallman . Will read it carefully!

要查看或添加评论，请登录

Jeff Tallman的更多文章

Time Series Vectors in Neo4j

2024年6月19日

Time Series Vectors in Neo4j

What really is time series data? When people of time series data, the context is not always the same. In some cases, it…

2 条评论
Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

2024年4月4日

Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

For those of you near to the "Big Apple" and willing to brave the arduous trek into the city..

3 条评论
The Challenges of Virtual Knowledge Graphs

2023年12月21日

The Challenges of Virtual Knowledge Graphs

Nothing is new under the sun. The idea of virtual access to data in remote (and often heterogeneous) data stores is…
Vector Indexing plus Knowledge Graphs with Neo4j

2023年11月21日

Vector Indexing plus Knowledge Graphs with Neo4j

TL;DR: Normally vector indexing is thought of as a common implementation for Generative AI. While this is true at a…

4 条评论
Modeling Longitudinal/Time Series/Sequential Data in Neo4j

2023年7月31日

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

…and Why Analyzing It Is a Graph Problem One of the most common modeling questions we get asked is how to model time…
Modeling Discrete Relationships in Neo4j

2023年7月13日

Modeling Discrete Relationships in Neo4j

Graph Modeling One of the comments on my earlier blog was about graph modeling – it is an area that has tremendous…

1 条评论
Graph Feature Engineering for Longitudinal Data (aka Time Series)

2023年6月24日

Graph Feature Engineering for Longitudinal Data (aka Time Series)

TL;DR: Graph feature engineering in Neo4j for longitudinal (aka time-series) data allows path analysis and other…

2 条评论
DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

2023年6月12日

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

A prospective customer asked me a question the other day – why use Neo4j for data science instead of my [existing]…

5 条评论
Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

2022年12月15日

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

TL;DR: Why RDF databases don’t scale well for Knowledge Graphs Let’s face it, technologists are the worst purveyors of…

4 条评论
Hamiltonian Paths

2022年11月14日

Hamiltonian Paths

Last time I posted – admittedly a while ago…..

See all articles

Jeff Tallman的更多文章

Time Series Vectors in Neo4j

Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

The Challenges of Virtual Knowledge Graphs

Vector Indexing plus Knowledge Graphs with Neo4j

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

Modeling Discrete Relationships in Neo4j

Graph Feature Engineering for Longitudinal Data (aka Time Series)

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

Hamiltonian Paths

社区洞察