Modeling Discrete Relationships in Neo4j
Graph Modeling
One of the comments on my earlier blog was about graph modeling – it is an area that has tremendous impact on database performance of any database.??The problem for a graph database is that the traditional types of denormalization such as vertical/horizontal schema changes we are all familiar with from RDBMS do not translate well – and actually hinder graph data science.?
I am a big fan of not throwing the “baby out with the bath water”, “re-inventing the wheel” and whatever cliché you may prefer to starting all over.??Sometimes – yes – it is necessary.?However, 30 decades of data modeling techniques shouldn’t be ignored just to question whether a square tire is advisable as it increases the contact surface with the road.??It also significantly increases rolling resistance by many orders of magnitudes.??Since many times data sources to Neo4j are coming from RDBMS – even if via a data lake such as DataBricks?, it is tempting to use your existing ER model as a starting place and start reversing out the RDBMS specific tuning.??While doable, we are so ingrained into the relational model approach at this point that often this fails unless you are very careful – and it likely will take longer.??Instead, the approach we suggest is the “whiteboard” model.??There are several distinct advantages to this:
I like to use the ER model as a “check” – e.g. am I missing anything.??
Graph Modeling Best-Practices
Obviously, a complete course on graph modeling would take a thick book – and some are out there – but I have my own list of graph modeling best practices:
I cover these in great detail on the 1-day modeling class that I teach as part of my field engineering tasks.?In today’s blog, I will focus on #6 – “Use Discrete Relationships” – because it came up in a customer situation and is a very common technique that enables data science much more easily.
Integrative Biomedical Knowledge Hub (iBKH)
I will use a public example from the intergrative Biomedical Knowledge Hup (iBKH) project available at https://github.com/wcm-wanglab/iBKH .????A picture of the model is as follows:
Note that in my discussion, I am not inferring anything wrong with the model as they have published it, but showing how based on requirements one might “refactor” it to enable specific uses case.?That use case: trying to find drugs that have similar relationships with diseases and whether they impact the same genes or contain the same molecules.??It would be a common simple question to see if drug efficacy is due to specific molecules or genes and then whether a side-effect could be avoided in cases where the genetic compositions of patients differ.??Scary thought, but medicine has advanced so rapidly into genetic specifics that I can almost foresee the day a trip to the doctors has them looking at your “23 and me” ? DNA testing to see what to prescribe.?
Drug -> Disease and Drug -> Gene Relationships
In order to work with the dataset, you have to download the nodes and relationships from?https://wcm.box.com/s/gagu6yj2toyk4kirb6hpsb1qu4dm203p?and https://wcm.box.com/s/dcq6lj4vxzs4rnxu6xx60ziwl62qrzyp respectively.?In the relationships data, there is a single file “D_Di_res.csv” that contains the relationship definition between Drug and Disease.?One of the things I tell people to do in my modeling class is always open the file and look at the data and try to figure out the modeling implications – not just the keys used to find the source and target nodes.?In this case, opening the file would show something like the following:
Note in their model schema – they only define a single relationship.?However, looking at the data, we can see that there are different roles/specific classes of relationships available.??In some cases, multiple types of relationships exist simultaneously.?Yes, we could load this as is and add properties on the relationship for each of these – either as an integer (as is) or we could cast to a Boolean.?But I think there is a better way.??The reason why is that in comparing drugs and diseases, I might want to have a better similarity of drugs that actually “Prevent” the disease vs. those that just seem to be a “biomarker for disease progression” – or simply “Palliate” (reduce severity or symptoms but not reduce the cause).
Similarly, if we look at the Drug to Gene relationship in the file D_G_res, we can see:
As you can see there are equal specifics such as “Downregulates”, “Upregulates”, etc.
It might be that this was done following the classic many-to-many join table implementation that would allow a RDBMS to load this data – but for a graph database, we can implement each of these as discrete relationships…so the model schema gets a bit of refinement:
Normal Forms vs. Graph Modeling
The reason for wanting those discrete relationships has to do with the syntax for most Neo4j GDS graph projections and algorithm execution.??The most optimal form is the “native” projection which takes the syntax of:
CALL gds.graph.project(
? graphName: String,
? nodeProjection: String or List or Map,
? relationshipProjection: String or List or Map,
? configuration: Map
) YIELD
? graphName: String,
? nodeProjection: Map,
? nodeCount: Integer,
? relationshipProjection: Map,
? relationshipCount: Integer,
? projectMillis: Integer
As you can tell – no real place for a “where clause” to filter on relationship properties.?The map for a relationshipProjection does allow for relationship properties for weights, but not for filtering – although you could project a graph, include all the properties and then project a subgraph – but that is a bit of a contortion.?The alternative is to use a cypher projection, which has the syntax:
CALL gds.graph.project(
??? graphName: String,
??? sourceNode: Node or Integer,
??? targetNode: Node or Integer,
??? dataConfig: Map,
??? configuration: Map
) YIELD
??? graphName: String,
??? nodeCount: Integer,
??? relationshipCount: Integer,
??? projectMillis: Integer,
??? configuration: Map
The way this works is the clauses prior to this procedure call includes query syntax to focus on the nodes in question – so it would be possible to do something:
MATCH (drug:Drug)-[:TREATS {Prevents: 1}]-> (disease:Disease)
CALL gds.graph.project(“drugdisease”, drug, disease, …)
Which would work but it has a few problems – you would need to do a few UNION ALL to also include the combinations of Drug->Gene …and adding weights gets a bit more complicated as well as the syntax do call out the various combinations of relationship classes we would want vs. just the one.??Think about the example that “Treats” and “Palliative” might have a very low weights vs. “Prevents” or “Treatment/Therapy”.
This is not unique in graph modeling – in RDBMS world, optimizations – most often denormalizations – involve reducing the “normal form” from 3rd NF to 2nd NF or other.??Optimizing a graph modeling is often the opposite – going from 2nd or 3rd NF to possibly BCNF, 4th NF or 5th NF.?For Labeled Property Graphs it mostly stops there without digressing into the mess of 6th NF that RDF-based graph DBMS require.?It’s not exactly accurate to describe that way, though, as the notion of different relationships between the same exact two tuples in a relational model doesn’t exist.??But since a picture is worth a million words….
As you can see from the diagram, optimizations for RDBMS tends to reduce the normal form adherence whereas with LPG GDBMS, it tends to increase it.??RDF based GDBMS tend by 6th NF simply by definition and no real way to change that via optimizations.
Now, knowing we want to create these discrete relationships, there are two ways we can do this – simply load the data as is and refactor the graph after the fact – or create the relationships during the load itself.
Creating the discrete relationships via refactoring
Let’s assume we already loaded the data – or we got it via a dump file with data already loaded.?To refactor the graph, we would simply need to execute a series of cypher statements such as:
MATCH (drug:Drug)-[:TREATS {Palliates: 1}]->(disease:Disease)
CALL {
????????? WITH drug, disease
????????? MERGE (drug)-[:PALLIATES]->(disease)
} IN TRANSACTIONS OF 10000 ROWS
;
?
MATCH (drug:Drug)-[:TREATS {InhibitsCellGrowth: 1}]->(disease:Disease)
CALL {
????????? WITH drug, disease
????????? MERGE (drug)-[:INHIBITS_CELL_GROWTH]->(disease)
} IN TRANSACTIONS OF 10000 ROWS
;
?
MATCH (drug:Drug)-[:TREATS {Alleviates: 1}]->(disease:Disease)
CALL {
????????? WITH drug, disease
????????? MERGE (drug)-[:ALLEVIATES]->(disease)
} IN TRANSACTIONS OF 10000 ROWS
;
MATCH (drug:Drug)-[:TREATS {DiseaseProgressionBiomarker: 1}]->
(disease:Disease)
CALL {
????????? WITH drug, disease
????????? MERGE (drug)-[:PROGRESSION_BIOMARKER]->(disease)
} IN TRANSACTIONS OF 10000 ROWS
;
…
Using the CALL {} IN TRANSACTIONS syntax allows us to batch up the modifications into groups of transactions vs. running the refactoring as a single large transaction and possibly exceeding the memory available for heap cache in Neo4j.??Note that if you look at the data, every one of the records seems to indicate that there is always a TREATS relationship – so we do not need to remove the original relationship when done.??It can be thought of as a generalized relationship that can be used whenever you want to see any relationship between a drug and a disease regardless of the type.
Creating the discrete relationships during load
At its core, neo4j has the CREATE clause which can be used to create either nodes or relationships.?In addition, neo4j also supports the MERGE clause which has an ON CREATE SET and ON MATCH SET subclauses which allows you to create a node/relationship if it doesn’t exist or if it does – update it – and have different actions based on each case.??I prefer using the MERGE command for both – however, if I absolutely know the database is empty, I might use the CREATE command.??Both of these commands are useful for creating a single node/relationship.
However, we often are not creating just a single node.??This is where the LOAD CSV clause comes in.??Neo4j’s cypher dialect is a “streaming” or “pipelined” language vs. a “statement” dialect such as SQL.?As a result, really long queries can be created in Neo4j where results from one clause simply feeds to next.?So a common load scenario into Neo4j in this case might look like:
// Assume Drug and Disease are already loaded
LOAD CSV WITH HEADERS FROM ‘<url or filepathname>’ as inputRow
WITH inputRow
WHERE toInteger(inputRow.`Treats`) = 1
MATCH (drug:Drug {primary: inputRow.drug})
MATCH (disease:Disease {primary: inputRow.disease})
CALL {
WITH drug, disease, inputRow
MERGE (drug)-[rel:TREATS]->(disease)
ON CREATE SET rel.source=inputRow.source
, rel.inferenceScore=toFloat(inputRow.inferenceScore)
} IN TRANSACTIONS OF 1000 ROWS
In the above, the LOAD CSV simply starts streaming the file contents row-by-row to the rest of the clauses in the query.??As each row is processed, it first is assigned as a map to the variable inputRow.?For those that have been around Neo4j for a while, the older “USING PERIODIC COMMIT” prefix clause to LOAD CSV has been deprecated in favor of the more utilitarian CALL {} IN TRANSACTIONS.
For those not wishing to do the coding…and it can get to be a bit tedious….there are several tools out on the market to support loading data into Neo4j.??The most basic of which is the Neo4j Import tool which currently is available for Aura in the Workspace preview.??It is the most simplest and fastest way to get started – but is also the slowest as it uses a single thread of execution with no parallelism.?It looks like the below:
However, my go-to tool for any serious data loading is Apache Hop? which is available at https://hop.apache.org/.??The reason I prefer it is that it is more of a classic ETL tool:
…and other features.??An example screen shot for this situation (overloading relationships) would look like:
Effectively, it will execute all the relationship creation in parallel (along with executing other tasks in this pipeline in parallel). It starts with reading in the csv file – which is a simple push button dialog:
From there it is passed to a simple in-memory sort step.?I do this as I am batching the writes and running a lot in parallel – and at least it will minimize the contention as much as I can.? It then parallelizes the stream and sends a complete copy to each branch where it flows to the filter step, which has a simple dialog such as:
…and finally to the Neo4j cypher dialog:
Like I said, Apache Hop does support a codeless dialog to do this – but I like to control the cypher myself – especially when there might be (as typically is) more complicated logic such as conditional clauses.
And there you have it – modeling discrete relationships and refactoring the graph to implement them to support easier data science execution
Database Expert(Sybase/PostgreSQL/Neo4J/MS-SQL/Graph Databases/Cassandra) at Crédit Agricole CIB
1 年Very well explained with relevant use cases. I was looking for some articles on graph data modelling and this really fits in. Thanks Jeff Tallman for such great information. ??