Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

If you are a student of DBMS technologies, then you know that there are four basic join types:

  • Nested Loop (NLJ) – Uses an index to traverse from a parent node to all the children.? Works well when the number of children per parent is low cardinality and the number of parents is small.?E.g. the typical OLTP query.
  • (Sort) Merge (MJ/SMJ)– Uses a sorted list to find large numbers of children for a decent sized number of parents – commonly used in reporting and can be faster (especially when no sorting is required) than nested loop joins.
  • Hash Joins (HJ) – Used when really large numbers of parents/children being searched for – involves creating a Build/Probe list in memory on join keys and using hash values – typically when neither table has an effective index for the query.
  • Cartesian products ….

Wait a second you say…. “Cartesian Products are NOT a join strategy – in fact we try to avoid them!”??Well……that is not quite true.??In fact, 2 common use cases of cartesian products are lateral joins (two one row results) as well as a common fact table processing technique is to do a deliberate Cartesian of all the dimensions in the query and then do a hash join to the fact table.

Essentially, Graph DBMS’s replace the need for NLJ by replacing the index with a graph “edge” (relationship) that allows direct traversal from one entity to another without traversing multiple levels of an index.??A concept called “Index Free Adjacency” in the Graph DBMS world.

This opens up some very interesting query situations as this is commonly exploited in “multi-hop” graph traversals as there are very often needs to quickly traverse down a tree of a graph from the top to the bottom.??This is often supported in GDBMS through the use of “variable length” path operators – such as the *[range] notation in cypher ala:

MATCH circleOfFriends=(p1:Person)-[:IS_FRIENDS_WITH*1..5]->(p2:Person)
WHERE p1.name = “Jeff”
RETURN circleOfFriends        

The above instructs the GDBMS to traverse from the Person node whose name is “Jeff” along the “IS_FRIENDS_WITH” relationship type from 1 to 5 hops out and return all the paths found.??In a RDBMS using SQL (or at least one that doesn’t support CONNECT BY), you would write the above as series of UNION (ALL) operators such as:

// pseudo SQL code (didn't test so there may be errors)
SELECT 
FROM Person p1
??????? JOIN FRIENDS f1 ON p1.personID=f1.srcFriendID
??????? JOIN Person p2 ON f1.tgtFriendID=p2.friendID
WHERE p1.name = “Jeff”
UNION ALL
SELECT …
FROM Person p1
??????? JOIN FRIENDS f1 ON p1.personID=f1.srcFriendID
??????? JOIN Person p2 ON f1.tgtFriendID=p2.friendID
??????? JOIN Friends f2 ON p2.personID=f2.srcFriendID
??????? JOIN Person p3 ON f2.tgtFriendID=p3.personID
WHERE p1.name = “Jeff”
UNION ALL
SELECT …
FROM Person p1
??????? JOIN FRIENDS f1 ON p1.personID=f1.srcFriendID
??????? JOIN Person p2 ON f1.tgtFriendID=p2.friendID
??????? JOIN Friends f2 ON p2.personID=f2.srcFriendID
??????? JOIN Person p3 ON f2.tgtFriendID=p3.personID
??????? JOIN Friends f3 ON p3.personID=f3.srcFriendID
??????? JOIN Person p4 ON f3.tgtFriendID=p4.personID
WHERE p1.name = “Jeff”
……        

Yyyyiiiieeee….it starts to get ugly.??We've all been there. Cut & paste coding and you hope you don’t miss something.?Now, this is where things get a bit challenging.??What if you don’t know how many traversals you need to make???In SQL, you couldn’t do it as you can’t do an infinite number of UNION ALL’s – in fact after some number, you’d likely exceed the stack for even just the query command buffer (often 64KB).?Cypher supports this by not requiring a range, so you could have “simplified” the above by:

MATCH circleOfFriends=(p1:Person)-[:IS_FRIENDS_WITH*]->(p2:Person
WHERE p1.name = “Jeff”
RETURN circleOfFriends)        

Now, this is where life gets frustrating.??Sometimes, we hear from customers that such queries “never return” – aka the “Energizer Bunny” query.??It just keeps going.??

Let’s explore why.

Now, let’s understand a concept call “Trail” vs. “Walk” and how Relationship Isomorphism plays a role here.??Consider the following graph:

No alt text provided for this image

Now, let’s say we want to find all the paths from A->Z.??In Cypher, we’d simply say:

MATCH p=(:A)-[*]->(:Z)
RETURN p        

The result would be a small list:

1.??????A->B->C->D->E->Z

2.??????A->F->G->H->I->Z

3.??????A->J->K->L->M->Z

Yea!??Now let’s complicate the graph a smidge….cause real life is never so simple:

No alt text provided for this image

Some of you quickly spotted the loops at the top. Now if we run the same cypher:

MATCH p=(:A)-[*]->(:Z)
RETURN p        

We should get the results:

1.??????A->B->C->D->E->Z

2.??????A->F->G->H->I->Z

3.??????A->J->K->L->M->Z

4.??????A->B->C->N->O->P->C->D->E->Z

5.??????A->B->C->D->E->Q->R->S->E->Z

6.??????A->B->C->N->O->P->C->D->E->Q->R->S->E->Z

7.??????A->B->G->H->I->Z

8.??????A->B->G->H->M->Z

9.??????A->F->G->H->M->Z

10.??A->J->K->H->I->Z

11.??A->J->K->H->M->Z

Notice we don’t have any infinite looping.??It would be tempting to think that with the C->N->O->P->C as well as the E->Q->R->S->E loops in the above, that a graph traversal would end up spinning forever.?If we did a typical graph “walk”, we would.??Hmmmm….so how would we avoid the problem.??One quick thinking person might say “well the answer is obvious – only traverse each node one time!”??Unfortunately, this wouldn’t work out well.?As a result, paths 4-6 wouldn’t get returned as path #1 would have already traversed node “C” and “E”.?In fact, paths 7-11 also wouldn’t get returned as paths 2 and 3 also would have traversed some of the same nodes.

This is where relationship isomorphism comes in.??It is the concept behind a graph traversal technique called “trail” (vs. “walk”) which rather than exploring every branch from every node regardless (as “walk” does) it instead only traverses a single relationship one time per path generation.?Thus the infinite loops would be blocked as the relationships C->N and E->Q can only be traversed once per path search.??

Sooooo…then why the “Energizer Bunny” queries…..

Well…..think of what happens in the above graphs if I add a relationship from H->E.??How many more paths are created????Answer – at least 7.? Soooo...the more "highly connected" a graph is .....the more paths.

At the extreme scale, let’s take on our “Friend of a friend” graph (with a bit of color coding for ease of discussion):

No alt text provided for this image

If we run the query:

MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*..2]->(f:Person)
RETURN length(p) as pathLength, count(p) as numPaths
ORDER BY pathLength
;        

We get the results (expressed as comments here - to make things simpler later):

// Started streaming 2 records after 2 ms and completed after 23 ms.
//╒════════════╤══════════╕
//│"pathLength"│"numPaths"│
//╞════════════╪══════════╡
//│1?????????? │4???????? │
//├────────────┼──────────┤
//│2?????????? │32??????? │
//└────────────┴──────────┘        

Okay – the first one we understand – there are precisely 4 people one hop out.??But 32 for 2 hops????Hmmm….looking at the graph we see that there are:

  • 4 people (in corner) each of one 2 hop path (yellow + blue)
  • 8 people (purple) have at least 2 paths for a total of 16 – for example, I can get from Jeff to Amy via both Dave and Pam with 2 hops
  • The 4 close friends (1 hop away) all can have 2 hop paths ala Jeff -> Dave via Mary or Pam (over orange links).?This is another 8 paths

So that’s 4+16+8 = 28.

Hmmmm….we are missing 4.

Ahhh….but if we think about this a graph perspective, we also have 2 hop results from our close friends such as Jeff -> Dave -> Jeff.??Weird, but a still a valid 2 hop graph output that does not traverse the same relationship twice.???Sooo….that’s the “missing” 4.

Now then, let’s make the query a bit more fun:

MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*..10]->(f:Person)
RETURN length(p) as pathLength, count(p) as numPaths
ORDER BY pathLength
;

// Started streaming 10 records after 1 ms and completed after 102656 ms.
//╒════════════╤══════════╕
//│"pathLength"│"numPaths"│
//╞════════════╪══════════╡
//│1?????????? │4???????? │
//├────────────┼──────────┤
//│2?????????? │32??????? │
//├────────────┼──────────┤
//│3?????????? │168?????? │
//├────────────┼──────────┤
//│4?????????? │940?????? │
//├────────────┼──────────┤
//│5?????????? │5024????? │
//├────────────┼──────────┤
//│6?????????? │26780???? │
//├────────────┼──────────┤
//│7?????????? │139996??? │
//├────────────┼──────────┤
//│8?????????? │725556??? │
//├────────────┼──────────┤
//│9?????????? │3709104?? │
//├────────────┼──────────┤
//│10????????? │18761136? │
//└────────────┴──────────┘        

Okayyyyy….we went from 23ms to 1.7 minutes (102 seconds).?But notice that we went to a total of ~23 million paths – increasing ~1 order of magnitude per increase in hop to a peak of ~<19 million paths.??From a standpoint of performance expectations, we are executing 23m/102 = 2.2 million traversals per second on a virtual CPU (1/2 a core) which is in-line with our 4-5 million traversals per core typical performance expectation.??However, it also shows how things quick expand.??Let’s take it a couple of steps further:

MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*..11]->(f:Person)
RETURN count(p) as numPaths
;

// Started streaming 1 records after 10 ms and completed after 458178 ms.
// 117,122,572
// 4.5x longer, 5x more paths vs 10 hops; 7m38s


?

MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*..12]->(f:Person)
RETURN count(p) as numPaths
;

// Started streaming 1 records after 10 ms and completed after 2468528 ms.
// 580,371,748
// 5x longer, 5x more paths vs 11 hops; 41m08s

?

?
MATCH p=(j:Person {name: "Jeff"})-[:IS_FRIENDS_WITH*]->(f:Person)
RETURN length(p) as pathLength, count(p) as numPaths
ORDER BY pathLength
;

// Killed after >2 hours of running        

We jump from 1.7 to 7.5 minutes and then to 41 minutes (on somewhat a linear progression)…..and then we get “Energizer Bunny”.??If we plot our original 10 hop results on a line, we would get something like:

No alt text provided for this image

Sooo…we can see using a logarithmic scale, it is extremely linear (although diverging slightly).?If we extrapolate, then, if we ran 13 hops, we could expect a 5x increase in runtime to 3.5 hours (ugh) and 14 hops would be ~17 hours…..ouch!??While there likely is a finite number of paths, the number is probably way beyond our patience of waiting for results.??

Soooooo….yepper…..good way to saturate a CPU.??Could we run it in parallel???? Absolutely!?In the past, you could easily do this via the APOC library calls such as apoc.cypher.parallel – but the plan is that future version of Neo4j will support parallel query natively by specifying a parallel runtime engine in the query. ??Today, in addition to APOC, you could code a stored procedure to use parallel threads within Neo4j quite easily.? But…..what degree of parallelism would give us good results???Think about it – even with 10-way parallelism, a 14 hop query would still likely take >1 hour to get results and figuring out how to do the parallel branches without repeating traversals would get to be ….a tad fun.

The net of this discussion is three-fold:

  1. When working with highly-connected graphs, be very careful using unbounded variable length paths – or even large ranges (such as 10 hops) as the number of paths returned could be huge
  2. If you need to deeply traverse such a graph, you may need to use a bit of custom logic to avoid repeatedly traversing subgraphs
  3. Brute force parallelism likely is not the answer

I mention #2 because the reason I got into this mess to start with was a customer set us a job dependency graph from a job scheduling environment – and due to the fact that some jobs had loops where execution ran until some conditions were met – there were multiple instances of “infinite loops” as well as large numbers of interconnections, etc.?The result was I wrote a stored procedure that implemented a breadth first search (vs. a depth first search) tree-traversal that used node isomorphism per path instead and by returning a list of unique nodes/paths the full tree could be visualized.??Some may say …. “but didn’t you say node isomorphism wouldn’t return all the results?”…..well true, but if we look at our earlier example of A->Z, using node isomorphism, we’d get the results:

1.??????A->B->C->D->E->Z

2.??????A->F->G->H->I->Z

3.??????A->J->K->L->M->Z

4.??????A->B->C->N->O->P

5.??????A->B->C->D->E->Q->R->S

6.??????A->B->G

7.??????A->J->K

Several of these paths do not end up at the final desired destination of Z.?However, if displaying them with a visualization tool such as Neo4j Bloom – all the connections would in fact be there – and the results displayed in sub-second even with 40+ hops.??And I did so on a single execution thread. ?I originally thought I would need to use parallel threads……but retrieving >9000 jobs with 40+ hops came back in something like 100ms……and trust me – it took my poor laptop a lot longer than that to try to render the results on the screen….so parallelism would not have achieved anything.

If I would have forced #3 without the smarter logic the way our competition suggests…..wellllllll….with 40 hops, it’d still be running and some data center somewhere where our cloud systems are running would be turning lotsa water into steam keeping things cool….and rivaling bitcoin’s excessive energy requirements to produce 1 result.

Tidbit for all you crypto coin aficionado’s.??Bitcoin mining consumes 91 terawatt-hours of electricity annually……and it doesn’t even have real widespread usage yet.??That’s more than the entire energy consumption of all of Google (as per?https://www.businessinsider.com/bitcoin-mining-electricity-usage-more-than-google-2021-9#:~:text=Bitcoin%20mining%20consumes%20around%2091,from%20just%20five%20years%20ago. )

Obviously, that presents a real challenge for massive public adoption.???But even on a smaller scale, it points out that massively parallel processing also quickly hits adoption limits within a corporation as it becomes financially unsustainable.

Work smarter.? Not harder.

There is no better than you on RDBMS joins, Jeff Tallman . Will read it carefully!

回复

要查看或添加评论,请登录

Jeff Tallman的更多文章

社区洞察