登录查看更多内容

Modeling Discrete Relationships in Neo4j

Jeff Tallman

Principal Sales Engineer at Neo4j

发布日期: 2023年7月13日

Graph Modeling

One of the comments on my earlier blog was about graph modeling – it is an area that has tremendous impact on database performance of any database.??The problem for a graph database is that the traditional types of denormalization such as vertical/horizontal schema changes we are all familiar with from RDBMS do not translate well – and actually hinder graph data science.?

I am a big fan of not throwing the “baby out with the bath water”, “re-inventing the wheel” and whatever cliché you may prefer to starting all over.??Sometimes – yes – it is necessary.?However, 30 decades of data modeling techniques shouldn’t be ignored just to question whether a square tire is advisable as it increases the contact surface with the road.??It also significantly increases rolling resistance by many orders of magnitudes.??Since many times data sources to Neo4j are coming from RDBMS – even if via a data lake such as DataBricks?, it is tempting to use your existing ER model as a starting place and start reversing out the RDBMS specific tuning.??While doable, we are so ingrained into the relational model approach at this point that often this fails unless you are very careful – and it likely will take longer.??Instead, the approach we suggest is the “whiteboard” model.??There are several distinct advantages to this:

You can focus on just the key entities and attributes for the problem.
You can design the relationships necessary to optimize for graph queries and data science.
You can adapt the model to the specific use case you need.

I like to use the ER model as a “check” – e.g. am I missing anything.??

Graph Modeling Best-Practices

Obviously, a complete course on graph modeling would take a thick book – and some are out there – but I have my own list of graph modeling best practices:

Always articulate the business questions and how to answer them.
Minimize the properties (no decoration!).
Elevate Properties to Nodes/Singletons (Properties as Nodes) that are commonly joined.
Use Multiple Labels (Properties as Labels) vs. filters.
Composite Indices vs. Singleton Anchors.
Use Discrete Relationships.
Add chains/linked lists (time series).
Create Aggregate Relationships for path chaining.
Avoid high density nodes (“Relationship Bundling”).
Find missing intermediate nodes.
Create Missing Relationships (Dynamic/Inferred, Status properties).

I cover these in great detail on the 1-day modeling class that I teach as part of my field engineering tasks.?In today’s blog, I will focus on #6 – “Use Discrete Relationships” – because it came up in a customer situation and is a very common technique that enables data science much more easily.

Integrative Biomedical Knowledge Hub (iBKH)

I will use a public example from the intergrative Biomedical Knowledge Hup (iBKH) project available at https://github.com/wcm-wanglab/iBKH .????A picture of the model is as follows:

No alt text provided for this image — (iBKH Schema from https://github.com/wcm-wanglab/iBKH )

Note that in my discussion, I am not inferring anything wrong with the model as they have published it, but showing how based on requirements one might “refactor” it to enable specific uses case.?That use case: trying to find drugs that have similar relationships with diseases and whether they impact the same genes or contain the same molecules.??It would be a common simple question to see if drug efficacy is due to specific molecules or genes and then whether a side-effect could be avoided in cases where the genetic compositions of patients differ.??Scary thought, but medicine has advanced so rapidly into genetic specifics that I can almost foresee the day a trip to the doctors has them looking at your “23 and me” ? DNA testing to see what to prescribe.?

Drug -> Disease and Drug -> Gene Relationships

In order to work with the dataset, you have to download the nodes and relationships from?https://wcm.box.com/s/gagu6yj2toyk4kirb6hpsb1qu4dm203p?and https://wcm.box.com/s/dcq6lj4vxzs4rnxu6xx60ziwl62qrzyp respectively.?In the relationships data, there is a single file “D_Di_res.csv” that contains the relationship definition between Drug and Disease.?One of the things I tell people to do in my modeling class is always open the file and look at the data and try to figure out the modeling implications – not just the keys used to find the source and target nodes.?In this case, opening the file would show something like the following:

Note in their model schema – they only define a single relationship.?However, looking at the data, we can see that there are different roles/specific classes of relationships available.??In some cases, multiple types of relationships exist simultaneously.?Yes, we could load this as is and add properties on the relationship for each of these – either as an integer (as is) or we could cast to a Boolean.?But I think there is a better way.??The reason why is that in comparing drugs and diseases, I might want to have a better similarity of drugs that actually “Prevent” the disease vs. those that just seem to be a “biomarker for disease progression” – or simply “Palliate” (reduce severity or symptoms but not reduce the cause).

Similarly, if we look at the Drug to Gene relationship in the file D_G_res, we can see:

As you can see there are equal specifics such as “Downregulates”, “Upregulates”, etc.

It might be that this was done following the classic many-to-many join table implementation that would allow a RDBMS to load this data – but for a graph database, we can implement each of these as discrete relationships…so the model schema gets a bit of refinement:

Normal Forms vs. Graph Modeling

The reason for wanting those discrete relationships has to do with the syntax for most Neo4j GDS graph projections and algorithm execution.??The most optimal form is the “native” projection which takes the syntax of:

CALL gds.graph.project(
? graphName: String,
? nodeProjection: String or List or Map,
? relationshipProjection: String or List or Map,
? configuration: Map
) YIELD
? graphName: String,
? nodeProjection: Map,
? nodeCount: Integer,
? relationshipProjection: Map,
? relationshipCount: Integer,
? projectMillis: Integer

As you can tell – no real place for a “where clause” to filter on relationship properties.?The map for a relationshipProjection does allow for relationship properties for weights, but not for filtering – although you could project a graph, include all the properties and then project a subgraph – but that is a bit of a contortion.?The alternative is to use a cypher projection, which has the syntax:

CALL gds.graph.project(
??? graphName: String,
??? sourceNode: Node or Integer,
??? targetNode: Node or Integer,
??? dataConfig: Map,
??? configuration: Map
) YIELD
??? graphName: String,
??? nodeCount: Integer,
??? relationshipCount: Integer,
??? projectMillis: Integer,
??? configuration: Map

The way this works is the clauses prior to this procedure call includes query syntax to focus on the nodes in question – so it would be possible to do something:

领英推荐

Data Science

QAP Software Solutions 2 年前

Tools in Data Science

BzAnalytics 2 年前

Know how Pandas Profiling makes data exploration…

Sankhyana Consultancy Services Pvt. Ltd. 1 年前

MATCH (drug:Drug)-[:TREATS {Prevents: 1}]-> (disease:Disease)
CALL gds.graph.project(“drugdisease”, drug, disease, …)

Which would work but it has a few problems – you would need to do a few UNION ALL to also include the combinations of Drug->Gene …and adding weights gets a bit more complicated as well as the syntax do call out the various combinations of relationship classes we would want vs. just the one.??Think about the example that “Treats” and “Palliative” might have a very low weights vs. “Prevents” or “Treatment/Therapy”.

This is not unique in graph modeling – in RDBMS world, optimizations – most often denormalizations – involve reducing the “normal form” from 3rd NF to 2nd NF or other.??Optimizing a graph modeling is often the opposite – going from 2nd or 3rd NF to possibly BCNF, 4th NF or 5th NF.?For Labeled Property Graphs it mostly stops there without digressing into the mess of 6th NF that RDF-based graph DBMS require.?It’s not exactly accurate to describe that way, though, as the notion of different relationships between the same exact two tuples in a relational model doesn’t exist.??But since a picture is worth a million words….

As you can see from the diagram, optimizations for RDBMS tends to reduce the normal form adherence whereas with LPG GDBMS, it tends to increase it.??RDF based GDBMS tend by 6th NF simply by definition and no real way to change that via optimizations.

Now, knowing we want to create these discrete relationships, there are two ways we can do this – simply load the data as is and refactor the graph after the fact – or create the relationships during the load itself.

Creating the discrete relationships via refactoring

Let’s assume we already loaded the data – or we got it via a dump file with data already loaded.?To refactor the graph, we would simply need to execute a series of cypher statements such as:

MATCH (drug:Drug)-[:TREATS {Palliates: 1}]->(disease:Disease)
CALL {
????????? WITH drug, disease
????????? MERGE (drug)-[:PALLIATES]->(disease)
} IN TRANSACTIONS OF 10000 ROWS
;
?
MATCH (drug:Drug)-[:TREATS {InhibitsCellGrowth: 1}]->(disease:Disease)
CALL {
????????? WITH drug, disease
????????? MERGE (drug)-[:INHIBITS_CELL_GROWTH]->(disease)
} IN TRANSACTIONS OF 10000 ROWS
;
?
MATCH (drug:Drug)-[:TREATS {Alleviates: 1}]->(disease:Disease)
CALL {
????????? WITH drug, disease
????????? MERGE (drug)-[:ALLEVIATES]->(disease)
} IN TRANSACTIONS OF 10000 ROWS
;

MATCH (drug:Drug)-[:TREATS {DiseaseProgressionBiomarker: 1}]->
           (disease:Disease)
CALL {
????????? WITH drug, disease
????????? MERGE (drug)-[:PROGRESSION_BIOMARKER]->(disease)
} IN TRANSACTIONS OF 10000 ROWS
;
…

Using the CALL {} IN TRANSACTIONS syntax allows us to batch up the modifications into groups of transactions vs. running the refactoring as a single large transaction and possibly exceeding the memory available for heap cache in Neo4j.??Note that if you look at the data, every one of the records seems to indicate that there is always a TREATS relationship – so we do not need to remove the original relationship when done.??It can be thought of as a generalized relationship that can be used whenever you want to see any relationship between a drug and a disease regardless of the type.

Creating the discrete relationships during load

At its core, neo4j has the CREATE clause which can be used to create either nodes or relationships.?In addition, neo4j also supports the MERGE clause which has an ON CREATE SET and ON MATCH SET subclauses which allows you to create a node/relationship if it doesn’t exist or if it does – update it – and have different actions based on each case.??I prefer using the MERGE command for both – however, if I absolutely know the database is empty, I might use the CREATE command.??Both of these commands are useful for creating a single node/relationship.

However, we often are not creating just a single node.??This is where the LOAD CSV clause comes in.??Neo4j’s cypher dialect is a “streaming” or “pipelined” language vs. a “statement” dialect such as SQL.?As a result, really long queries can be created in Neo4j where results from one clause simply feeds to next.?So a common load scenario into Neo4j in this case might look like:

// Assume Drug and Disease are already loaded
LOAD CSV WITH HEADERS FROM ‘<url or filepathname>’ as inputRow
WITH inputRow
WHERE toInteger(inputRow.`Treats`) = 1
MATCH (drug:Drug {primary: inputRow.drug})
MATCH (disease:Disease {primary: inputRow.disease})
CALL {
    WITH drug, disease, inputRow
    MERGE (drug)-[rel:TREATS]->(disease)
    ON CREATE SET rel.source=inputRow.source
                , rel.inferenceScore=toFloat(inputRow.inferenceScore)
} IN TRANSACTIONS OF 1000 ROWS

In the above, the LOAD CSV simply starts streaming the file contents row-by-row to the rest of the clauses in the query.??As each row is processed, it first is assigned as a map to the variable inputRow.?For those that have been around Neo4j for a while, the older “USING PERIODIC COMMIT” prefix clause to LOAD CSV has been deprecated in favor of the more utilitarian CALL {} IN TRANSACTIONS.

For those not wishing to do the coding…and it can get to be a bit tedious….there are several tools out on the market to support loading data into Neo4j.??The most basic of which is the Neo4j Import tool which currently is available for Aura in the Workspace preview.??It is the most simplest and fastest way to get started – but is also the slowest as it uses a single thread of execution with no parallelism.?It looks like the below:

However, my go-to tool for any serious data loading is Apache Hop? which is available at https://hop.apache.org/.??The reason I prefer it is that it is more of a classic ETL tool:

Supports a variety of sources in different file formats (not just csv)
Supports common databases – anything JDBC, Neo4j, many NoSQL
Allows me to specify or control parallelism
Can be code free (but I tend to do the cypher myself)
Offers data cleansing and other aspects to manipulate the data

…and other features.??An example screen shot for this situation (overloading relationships) would look like:

Effectively, it will execute all the relationship creation in parallel (along with executing other tasks in this pipeline in parallel). It starts with reading in the csv file – which is a simple push button dialog:

From there it is passed to a simple in-memory sort step.?I do this as I am batching the writes and running a lot in parallel – and at least it will minimize the contention as much as I can.? It then parallelizes the stream and sends a complete copy to each branch where it flows to the filter step, which has a simple dialog such as:

…and finally to the Neo4j cypher dialog:

Like I said, Apache Hop does support a codeless dialog to do this – but I like to control the cypher myself – especially when there might be (as typically is) more complicated logic such as conditional clauses.

And there you have it – modeling discrete relationships and refactoring the graph to implement them to support easier data science execution

Deepak Pahuja

Database Expert(Sybase/PostgreSQL/Neo4J/MS-SQL/Graph Databases/Cassandra) at Crédit Agricole CIB

1 年

Very well explained with relevant use cases. I was looking for some articles on graph data modelling and this really fits in. Thanks Jeff Tallman for such great information. ??

要查看或添加评论，请登录

Jeff Tallman的更多文章

Time Series Vectors in Neo4j

2024年6月19日

Time Series Vectors in Neo4j

What really is time series data? When people of time series data, the context is not always the same. In some cases, it…

2 条评论
Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

2024年4月4日

Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

For those of you near to the "Big Apple" and willing to brave the arduous trek into the city..

3 条评论
The Challenges of Virtual Knowledge Graphs

2023年12月21日

The Challenges of Virtual Knowledge Graphs

Nothing is new under the sun. The idea of virtual access to data in remote (and often heterogeneous) data stores is…
Vector Indexing plus Knowledge Graphs with Neo4j

2023年11月21日

Vector Indexing plus Knowledge Graphs with Neo4j

TL;DR: Normally vector indexing is thought of as a common implementation for Generative AI. While this is true at a…

4 条评论
Modeling Longitudinal/Time Series/Sequential Data in Neo4j

2023年7月31日

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

…and Why Analyzing It Is a Graph Problem One of the most common modeling questions we get asked is how to model time…
Graph Feature Engineering for Longitudinal Data (aka Time Series)

2023年6月24日

Graph Feature Engineering for Longitudinal Data (aka Time Series)

TL;DR: Graph feature engineering in Neo4j for longitudinal (aka time-series) data allows path analysis and other…

2 条评论
DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

2023年6月12日

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

A prospective customer asked me a question the other day – why use Neo4j for data science instead of my [existing]…

5 条评论
Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

2022年12月15日

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

TL;DR: Why RDF databases don’t scale well for Knowledge Graphs Let’s face it, technologists are the worst purveyors of…

4 条评论
Hamiltonian Paths

2022年11月14日

Hamiltonian Paths

Last time I posted – admittedly a while ago…..
Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

2022年4月4日

Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

If you are a student of DBMS technologies, then you know that there are four basic join types: Nested Loop (NLJ) – Uses…

1 条评论

See all articles

Modeling Discrete Relationships in Neo4j

Jeff Tallman

Principal Sales Engineer at Neo4j

Graph Modeling

Graph Modeling Best-Practices

Integrative Biomedical Knowledge Hub (iBKH)

Drug -> Disease and Drug -> Gene Relationships

Normal Forms vs. Graph Modeling

领英推荐

Creating the discrete relationships via refactoring

Creating the discrete relationships during load

Jeff Tallman的更多文章

社区洞察

其他会员也浏览了

Top 7 Data Science Tools for 2023

How to quickly perform EDA

Data Science

A Beginner’s Guide To Data Science

R's significance in current and future data science

Unleashing the Power of Data: Essential Skills for a Thriving Career in Data Science

Introduction to Data Science

Data Ontology & Taxonomy Primer – Part 5

Data Cleaning Techniques to Improve Your Analysis Workflow

From Raw Data to Actionable Insights: The Role of Preprocessing and Cleaning

Graph Modeling

Graph Modeling Best-Practices

Integrative Biomedical Knowledge Hub (iBKH)

Drug -> Disease and Drug -> Gene Relationships

Normal Forms vs. Graph Modeling

领英推荐

Creating the discrete relationships via refactoring

Creating the discrete relationships during load

Jeff Tallman的更多文章

Time Series Vectors in Neo4j

Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

The Challenges of Virtual Knowledge Graphs

Vector Indexing plus Knowledge Graphs with Neo4j

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

Graph Feature Engineering for Longitudinal Data (aka Time Series)

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

Hamiltonian Paths

Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

社区洞察

其他会员也浏览了

Top 7 Data Science Tools for 2023

How to quickly perform EDA

Data Science

A Beginner’s Guide To Data Science

R's significance in current and future data science

Unleashing the Power of Data: Essential Skills for a Thriving Career in Data Science

Introduction to Data Science

Data Ontology & Taxonomy Primer – Part 5

Data Cleaning Techniques to Improve Your Analysis Workflow

From Raw Data to Actionable Insights: The Role of Preprocessing and Cleaning