DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS
A prospective customer asked me a question the other day – why use Neo4j for data science instead of my [existing] RDBMS.??The answer is easy – but unfortunately not one you can easily give over a few minutes on a Zoom call because it requires an understanding of data structures and storage that typically data scientists don’t bother with or care about – but makes all the difference in this question.?
Now, to be fair, I don’t know which RDBMS they have – but the answer is the same no matter which of the usual suspects – Oracle?, MS SQL?, PostgreSQL?, DB2?, ….
….even if said RDBMS slaps some Bondo? over the rust and paints over top of it with some Rust-Oleum? and says they support graphs now by wrapping SQL around some GQL implemented on top of a key-value store implemented on top of SQL tables….you know – work-arounds on top of work-arounds.?Otherwise known as scotch tape and baling wire.??Yep – it grossly simplifies the 40-way unions of BOM explosions…..but just a bit too fragile for high speed.
I know, I know …some of you that know me are thinking that I’ve forgotten my ~30-year roots in the RDBMS world.??Nope – I just believe in using the best tools for the job – my toolbox has drills, saws, hammers and screwdrivers ….not just duct-tape & WD-40 – but got those too.
Let’s take a look – using some of the simplest data science requirements.
One of the first data science classes of algorithms that is introduced to budding data scientists is “Similarities”.??This class is often the first step in market segmentation and other types of grouping “similar” customers, health care providers, P&L insurance claims, etc.??The net result is that it forms a relationship between every entity and every other entity with a similarity score that suggests how similar those two entities are.??The results of such an operation typically results in a table such as:
FromPerson??? ToPerson????? SimilarityScore
Joe?????????? Sally???????? 0.01
Joe?????????? Fred????????? 0.02
Joe?????????? Tom?????????? 0.75
Joe?????????? Mary????????? 0.95
Sally???????? Mary????????? 0.55
Sally???????? Fred????????? 0.60
Sally???????? Joe?????????? 0.01
Sally???????? Tom?????????? 0.20
…
Now then, in nearly every RDMS, the only way this could be implemented is as a classic “Many:Many” relationship with a join table.???But wait, we are not done yet.??To implement this relationship, we not only need the table, we need two primary?key/foreign key relationships (indices) created.??Then we can quickly identify who is the most similar to “Joe” or “Sally” by doing the typical join
select FromPerson.name, ToPerson.name, similarity.SimilarityScore
FROM Person FromPerson
?????? JOIN SimilarityScore on FromPerson.name=similarity.FromPerson
?????? JOIN Person ToPerson on similarity.ToPerson=ToPerson.name
WHERE similarity.SimilarityScore > 0.66
Okay…doesn’t look so bad, does it.??Let’s dig a little deeper.
First, we have the schema change to add the join table.??Hmmmm …. oh, yeah….might have to wait until the next 3 day holiday to get that one implemented.
Second, we are not just dealing with 4 individuals (which is 4! combinations) – typically we are working with 10’s, 100’s of millions or sometimes billions (or even 10’s of billions) of distinct entities.??That means the join table is a possible 10,000,000! (can’t tell you the number as it overflows every calculator I have….but 1 million factorial is something like 9.9999E99….ouch!).??Yes, we can (and often do!) put a similarity threshold so that similarities < 0.5 (or whatever bottom threshold desired) are not stored…and this would eliminate the bulk of things….but still we are talking multiple orders of magnitudes more similarity scores to be stored in our table than base entities.??As a result, those two indices implementing the foreign key back to the Person table are likely to be 8+ levels deep in a typical B-Tree+ implementation…which means that join we just did?????Yep – a gazillion IO’s – and at that volume, some of them are going to be physical because unless you own Micron?, you can’t afford that much memory.
So….. it’s going to be slow.??Really slow.??Like, start your analysis before the weekend slow.
But wait, we are not done yet.??The next logical progression in a data science analysis is to take those similarities and try to find “communities”.??Remember, the “similarity” is just the foundation – the actual market segmentation, etc. – will be based on “Communities” of customers, providers, claims that are similar to each other.
Think about it…. If I am an Auto Insurance company and I have two groups of users:
Community??????????? Insured Auto
Joe, Mary ?????????? Porche Boxster
Fred, Sally????????? F-150 Limited
Despite the fact both have similar replacement costs, we all know who’s collision and medical premiums are likely higher.??After all – if that Porche hits the F-150, we all know who’s car is totaled and getting the ambulance ride vs. walking away with a dent in the fender.
So, we take that SimilarityScore table (from earlier) and apply our favorite “Community” detection algorithm – Weakly Connected Components (WCC), Louvain, Leiden, ….
…and from this we might get a list of Communities and members:
CommunityID????????? Members
1??????????????????? Joe, Tom, Mary
2??????????????????? Sally, Mary, Fred
…
Okay, now we have a problem.?If this was WCC, we’d probably just have a single community with all four…which would be a bit of a lie since Joe & Sally really are not similar (based on the above scores).??But if we did – and we only had a single community for each person – then what do we do with it???
Simple – add it as a column to our Person table.
Ooopppss…see the above comment and schema changes on 3 day weekends.???Unless, of course, the RDBMS happens to be a column store – and then you are home free.
If using Louvain, and looking at subcommunities which are more similar to each other and connections between communities are important to you (e.g. health care provider networking)….now you are implementing yet another many:many join table…..??Sigh….been there…done that.???
But wait!?There’s more!??We want to find the most influential person in our network – or the person who could connect multiple networks…??This is called “Centrality” in data science terms – things we have all heard before thanks to Google? and algorithms such as PageRank, along with Degree centrality, Betweenness Centrality, etc.??Every person in the network now has a centrality score – which is just one more column to add to our schema.
But, hey! Disks are cheap – it’s only $0.125/GB per month on AWS…..and we have….hmmm…
Soooo….only ~35GB or so….meh – not horrible.
But we have 12 data scientists running their own research.???So, multiply everything above by a factor of 10.?And, they likely want multiple similarities (customers, defect/return ratios, market/basket analysis for items commonly purchased together, etc.).
Uh oh!??So, we might have a few TB’s, so that would be $128/month per TB or $1500/year per TB….??Uh….maybe go back on-premise and by a cheap M2 NMVe stick and put in that desktop under your desk instead?
But this is just the tip of the iceberg….
Remember that similarity calculation???So how do we do that in a RDBMS???Answer – we don’t.?We do instead is write some nasty query (we’ll get there in a minute) that tries to aggregate the common items between two entities and export those results to disk so we can then feed it to some python implementation of Jaccard similarity, compute the results, stream those results to disk and then re-import them into the database.
领英推荐
First let’s consider just the data movement part of this.??Remember above, we have a factorial explosion of values – or close to it…..so we are going to need lots of disk space to export to.?Now, this is where we start having a problem.??On the good side, since we need lots of disk space, most cloud providers provision IOPS by the disk size – e.g. 10 IOPS/GB (we wish!)….but typically about 2500-3000 IOPS per 512GB provisioned.??So, this IOPS limit is going to govern the absolute fastest speed that our query can run.
But it will never hit that limit.?Why???Let’s take a look at common query for tool purchases at your favorite big box home improvement store.
SELECT person1.customerID, person2.customerID, toolCount=count(*)
FROM person person1
??????? JOIN sales sales1 ON person1.customerID = sales1.customerID
??????? JOIN salesdetail salesdetail1 ON sales.transactionID = salesdetail1.transactionID
??????? JOIN tools tools1 ON salesdetail1.productSKU = tools1.productSKU
??????? JOIN salesdetail salesdetail2 ON tools1.productSKU = salesdetail2.productSKU
??????? JOIN sales sales2 ON salesdetail2.transactionID = sales2.transactionID
??????? JOIN person person2 ON sales2.customerID = person2.customerID
WHERE person1.customerID != person2.customerID
AND tools.category = “Power Tools”
GROUP BY person1.customerID, person2.customerID
HAVING count(*)>0)
Whew!?Nothing like a 7-way join to get the CPU’s humming!??Ignoring all the IO costs of all those index traversals…..a gazillion to be sure…. Let’s consider just the aggregation.
In order to do the aggregation, we must maintain an intermediate (hopefully) in-memory table of the grouping keys (person1.customerID & person2.customerID) and the count value that we increment each time we find a successful join.?Remember our factorial explosion???Yep….it’s gonna be HHHUUUGGGEEE.??So, to do this efficiently, we have a choice:
We already know that this sort is going to spill to disk.??In the RDBMS I grew up on, this means tempdb.?And we commonly spent forever trying to size tempdb to handle all the concurrent query requirements only to be thwarted by some large query like this.?But even ignoring the disk space, consider the CPU/execution time problem:
If we opt for the faster hash-based implementation, we also need memory for millions of hash buckets plus the pointers on each hash entry for the next linked list item for that hash bucket.?If we assume we are not using a doubly linked list, the result is at least 2 64-bit values in addition to our grouping keys and aggregate values.
So, for 100 million pairs we would need:
And that’s ignoring the overhead of space allocation based on disk-backed sorting (e.g. a typical 2K DBMS page only can store ~1900 bytes of data).
Now, assuming we hit our max of 3000 IOPS (AWS gp3 storage), we would be exporting ~1 billion rows of results, probably using 4K IO’s (typical file system page size) of ~100 rows per 4K page (~32 bytes per row for 3 10-digit numbers using csv format) or 10 million IO’s @ 3K IOPS = 3333 seconds or 55 minutes…. which is just a bit of time – but why I said the IOPS limit likely would not be the limiter as we all know that the aggregate query of that size would likely run for multiple hours.
But we are exporting 1 billion rows….and then probably re-importing 100 million rows… which means we likely need to import using a bulk API in batches – maybe at 30K rows per second (yes – burst speed might be 100K rows/sec – but remember, we have an IOPS limit on the cloud storage – 100 million rows / 100 rows/4k page = 1 million IO’s at 3000 IOPS = 333 seconds or 5.5 minutes – a lot faster than the export!)….so total time on import/export is ~1 hour.??Some might point out that I can take the results of the similarity and create a pipeline directly to community detection and thus no need to import the similarity results…..but the reality is that you probably are going to run multiple different community detection algorithms using different configurations until you hit upon the combination that works best for that particular use case….so importing them for subsequent experimentation is a given.
….and that’s PER query.??And we have 12 data scientists.?And they are not going to get it right the first time – they are going to execute this sort of query multiple times …..and likely using more than just one dimension (tools) but likely multiple factors (lumber purchases, garden center purchases, etc.).???Sooo…unless we have TB’s of memory, the sorts will spill to disks faster due to concurrency which means running slower – which likely increases the likelihood of concurrency….and the snowball rolls downhill.
At this point, some wise DBA will point out that the RDBMS likely would have implemented a star-schema with a central FACT table of transactions and that as a result it wouldn’t be a 7-way join but only a 3-way join between the Customer dimension and the transaction FACT table.??And this would likely be done on a columnar store (such as SnowFlake?) and those adding the columns for storing communityID’s and centrality scores are a non-issue.
Yep – they are right.
But add in those multiple dimensions such as lumber, garden center and other purchases….and you are back somewhere in the neighborhood of 7+ way joins.
But even ignoring that, the aggregation memory requirements and data export space is still the same.??And the storage for the columnar store holding the star-schema in addition to the operational store….but we will ignore that, as in my mind, it probably would be necessary even in a graph world as KPI calculations are simply better done in a column store.
Now, let’s take a look at the same thing in Neo4j.
First the similarity.??To perform the similarity, we don’t need to do the massive 3-way or 7-way join with 200+GB of sort space to do the aggregation.? To be fair, such an aggregation would absolutely die in Neo4j or (or by default) exhaust all the memory in Neo4j and cause it to crash. The reason why is that unlike the older RDBMS's that planned for resource conservation to support 10's of thousands of users and thus spilled expensive sorts to disk, Neo4j attempts to do all the aggregation in heap memory. I recently got to experience this with a customer who was trying to compute two aggregate values between doctors that attended the same conference or authored the same papers. A classic similarity - but even with only 1.8 million doctors, such an aggregation resulted in ~300 million pairs with 2 aggregate values - far exceeding the memory of the machine being used.
Instead, we simply invoke the GDS library with the Node Similarity algorithm (which defaults to Jaccard vs. Overlap).??Now, it sounds like this might still need the same amount of memory as the 200+GB sort space the RDBMS join did…. But, we don’t.??GDS implements a memory compression technique called “Compressed Sparse Row” that is essentially a compression on top of adjacency lists – itself a very memory efficient implementation.??Add it the other memory saving techniques such as BitIdMaps and SharedIdMaps and the net result is the above Similarity computation probably uses <20GB of memory (based on my experience) and most of that will be in the similarity score space consumption.
The second savings is in speed – because this easily fits in memory, we are not impacted by the physical IO’s that happen when the sort for the aggregation spills to disk.
Third, once the similarity is done, to store the results, we don’t need a join table – we can simply create a relationship between the customer nodes.??The relationship can have a property for the similarity score.??Disk space usage??Probably the same as the join table, but without the index space.??Why??Because Neo4j implements a true graph store in which nodes and relationships are stored natively.??This is a huge improvement over graph implementations based on key value stores implemented on SQL tables (such as Oracle’s, StarDog’s, and…..a plethora of others) as those tables use indices to find the relationships.??While we do save a smidge of disk space for storing the results, the real savings is in the query speed for things like the community detection and subsequent processing.?
Did you notice that we never exported to a file???And never had to worry about the import?
Yeah – you may point out that your python implementation of similarity is run on a massively parallel GPU monster using 8 TESLA v100 cores and runs in <1 hour vs. the probable 2-3 hour run on GDS (using 4-way parallelism to support concurrency).??True.??But then….with 12 data scientists, you likely have to wait in line for 3 hours (okay only 2 as you need to export for 1 hours…after the 10 hours the aggregation query executes) to get your job to run and at $16/hr additional in costs or ~$3000/month…..?and suddenly the TCO for Neo4j is like a lighthouse on a dark and stormy night at sea.
Now the second consideration is in that we can mutate the in-memory graph projection for the similarity to perform the community detection and centrality without re-reading the results from disk (yeah – you could do this in your python libraries) – which makes the overall process really, really, really, really, simple. ??So simple a non-data scientist like me can do it.?Just don’t ask me what the results mean.
Some might say “what about the communityID’s and centrality scores? – don’t they take up time/space?”??Yes – but in this case, Neo4j is similar to a column store – adding a property to a node doesn’t require the massive change that adding a column to a typical row-based RDBMS table.??But better than that, we can very easily create new nodes that represent communities and on those nodes store statistics about the community size, the relative importance of the community, etc. which helps us focus on the frequent shoppers and contractors vs. the condo dwellers that visit once per year.??The beauty of a schema-free DBMS – no begging permission from the DBA’s first.
We call this ability to implement schema changes based on the results of graph data science “graph feature engineering”.??But in my mind, it begins way before that.?
One of the things we skipped over on the way to data science is that many of the algorithms are based on bi-partite and mono-partite implementations.??For example, a similarity algorithm runs only on bi-partite graphs – a single primary entity type (customer) and one or more secondary/dimension entities (power tools, lumber, garden center).??Community detection runs on mono-partite – e.g. customer to customer.??Think FaceBook? or Twitter? - that “follows/friends with” relationship – single entity ?- the subscriber.??For RDBMS, for the similarity, the only way you can do this is write those nasty 7+ way joins and export the pairs to disk in a file that infers the relationship between them.??Then import the rows to the join table to implement the relationship for future analysis.??Then possibly do another table for the communities….??
Which brings up an often overlooked aspect: remembering that the relationship exists via that table.??In Neo4j, we physically create the relationship as a schema object and as a result, any simple expansion on the node shows the relationship.??Net, net is that building the bi-partite and mono-partite relationships necessary for data science algorithms is a natural fit in Neo4j instead of the constant conceptual to physical translation that you must do when using an RDBMS.
I call that “Data Science friendly engineering”.
Database Expert(Sybase/PostgreSQL/Neo4J/MS-SQL/Graph Databases/Cassandra) at Crédit Agricole CIB
1 年Thanks for providing such great details about the RDBMS's and graph technology. I would be interested to learn more on graph data modelling so that new graph dB can make best use graph data science algorithms. But thanks for writing such nice article. ??
Principal Architect
1 年Awesome article Jeff.
Awesome detail and ROI!
Applied Technologist | Deaf | I love helping my customers use technology to solve problems
1 年Great stuff, Jeff!
Technology Leader | CEO | CTO
1 年Good blog Jeff!