登录查看更多内容

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

Jeff Tallman

Principal Sales Engineer at Neo4j

发布日期: 2022年12月15日

TL;DR: Why RDF databases don’t scale well for Knowledge Graphs

Let’s face it, technologists are the worst purveyors of making up words than politicians or even weathermen…..and that’s saying something.?When I was growing up, we had “flurries”, “snow squall” or “blizzard” – none of this “bomb cyclone”, “polar vortex”….and some crazy word that describes the layering of precipitation that Google can’t even find unless you know the word.??Don’t get me started on politicians – if they were weather forecasters, you’d still have to go outside to see what was really happening.??But….yep, we in the IT space are by far the worst offenders.??Three of my non-favorites have come up recently a lot …..let’s define them.??I had a lot of help from Wikipedia

Ontology.

Definition: a set of concepts and categories in a subject area or domain that shows their properties and the relations between them
Translated to English: Terms that describe all the business entities, their attributes and relationships between the entities
?CyberTrends: Typically implemented in an RDF due to ease of exchange as RDF is W3C standard for exchanging ontologies
In Reality:?It’s a schema in meta-data.

Digital Twin.

Definition: A virtual representation that serves as the real-time digital counterpart of a physical object or process.
Translated to English: Everything you know about an entity from a business perspective
CyberTrends: The buzzword du jour – and often used to define it in ways that are over complicating the reality
In Reality: The stuff you care about and everything you know about it.

Knowledge Graph.

Definition: A knowledge base that uses a graph-structured data model or topology to integrate data.
Translated to English: A graph-based data repository that is focused on a business domain or set of business problems on an aspect of the business
CyberTrends: Unfortunately, often distorted to mean an RDF graph database
In Reality:?The intersection of Ontologies and Digital Twins

Now, that last one really bugs me.??I don’t know how many times I’ve heard someone say “Yep – we have a knowledge graph, do you support RDF?”??I used to think I was alone in this until after yesterday’s amazing intro for the Neo4j Connections event on Life Sciences, Dr. Jesus Barrasa essentially said the same thing – combining an ontology and digital twins was a path to a knowledge graph.??At least, that’s what I dummied down what he said.?It was either a case of “great minds thinking alike” or “fools seldom differ”.??But since he is a super brilliant guy, I’ll go with the former.

Which brings me back to RDF as a Knowledge Graph and why they fail.??Well….actually, they succeed at very low scale – it is when you put lots of stuff in them that they fall apart.??Let’s investigate why.??If you had to implement your RDBMS instead of an ER model, but modeled the ER model itself, you’d have an RDF database.??First, understand that Resource Data Framework (RDF) is a W3C standard model for data interchange on the Web.?That means it is more like XML XSD than a database.??Databases that implement RDF are more accurate known as “Triple Stores” as the physical storage is typically in key value pair system with <object><relationship><object> “triples” on disk.??If you designed your RDBMS that way, it would be something like:

No alt text provided for this image — A simple RDF model of People and Cars

Now, the important thing is that the red entities actually don’t have any data – just a unique identifier (GUID).?All the data values are in the blue entities.?A real ER schema might look like:

And now you know the difference between an RDF triple store and a labeled property graph.?Why is this a scalability concern???It all has to do with indexability.??In a typical RDBMS, we all know that composite indices are very common.??The most common index type is B-Tree+ - in which the lower-level leaf nodes contain the index key values (the composite key values) and a page/row pointer to the row of data.?Now this is important.??For a labeled property graph, such as Neo4j, since properties are stored on the node, we can easily create a B-Tree+ index on composite keys.??RDF has a fundamental problem.??They typically are not graph stores but key-value pair stores.?Take Oracle’s graph implementation for example:

领英推荐

K-nearest neighbor Classification(KNN)

Bluechip Technologies Asia 9 个月前

Data Intellect eNews02

Data Intellect 10 个月前

Data is now, new Lubricant to keep the economic engine…

Amlgo Labs 2 年前

SQL> describe myGraphVT$

?Name?????? Null???? Type

?------------------------- -------- ----------------

?VID?????? NOT NULL NUMBER

?K????????????????? NVARCHAR2(3100)

?T????????????????? NUMBER(38)

?V????????????????? NVARCHAR2(15000)

?VN???? ????????????NUMBER

?VT???????????????? TIMESTAMP(6) WITH TIME ZONE

?SL???????????????? NUMBER

?VTS??????????????? DATE

?VTE??????????????? DATE

?FE???????????????? NVARCHAR2(4000)

?

?

SQL> describe myGraphGE$

?Name????? Null???? Type

?------------------------ -------- ------------------

?EID?????? NOT NULL NUMBER

?SVID????? NOT NULL NUMBER

?DVID????? NOT NULL NUMBER

?EL???????????????? NVARCHAR2(3100)

?K????????????????? NVARCHAR2(3100)

?T????????????????? NUMBER(38)

?V????????????????? NVARCHAR2(15000)

?VN???????????????? NUMBER

?VT???????????????? TIMESTAMP(6) WITH TIME ZONE

?SL???????????????? NUMBER

?VTS??????????????? DATE

?VTE??????????????? DATE

?FE???????????????? NVARCHAR2(4000)

Cited from: https://docs.oracle.com/en/database/oracle/property-graph/20.4/spgdg/using-property-graphs-oracle-database.html#GUID-6146A453-671F-489C-AD72-2C86920A8494

Those "K" and "V" thingy's......yep - Key...and Value....actually Value stored in different ways and formats. EL is Edge Label ...maybe.

As a result, a composite index is a bit difficult to create as the leaf node page/row pointer would have to point to 2 different rows in storage.??So, as a result, most common RDF implementations resort to the old trick of Sort Merge Joins (SMJ) – ala this example query and query plan taken from StarDog’s documentation:

SELECT DISTINCT ?person ?name

WHERE {

? ?article rdf:type bench:Article .

? ?article dc:creator ?person .

? ?inproc rdf:type bench:Inproceedings .

? ?inproc dc:creator ?person .

? ?person foaf:name ?name

}

?

Distinct [#812K]

`─ Projection(?person, ?name) [#812K]

? `─ MergeJoin(?person) [#812K]

????? +─ MergeJoin(?person) [#391K]

????? │? +─ Sort(?person) [#391K]

????? │? │? `─ MergeJoin(?article) [#391K]

????? │? │???? +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]

????? │? │???? `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]

????? │? `─ Scan[PSOC](?person, foaf:name, ?name) [#433K]

????? `─ Sort(?person) [#503K]

??????? `─ MergeJoin(?inproc) [#503K]

??????????? +─ Scan[POSC](?inproc, rdf:type, bench:Inproceedings) [#255K]

??????????? `─ Scan[PSOC](?inproc, dc:creator, ?person) [#898K]

?Cited?from: https://docs.stardog.com/archive/7.7.1/operating-stardog/database-administration/managing-query-performance.html

?Now, if you’ve studied DBMS query optimization for very long, you shudder at the thought of SMJ’s as they are some of the slowest join techniques on the planet.??They also don’t work well on larger volumes as you either consume a ton of memory for the sort….or more typically the sort “spills” to disk – which rrreeaaallllyyy slows it down.

How important is composite key indexing???Consider that the most voluminous data in any database is not the business entities but the transactions.??For example, banking transactions vs. bank accounts. ??Yes, each transaction likely does have a single unique transaction ID that I can look it up by very quickly.??Nice demo.?But what if I only know an account number and the transaction date??Without a composite index key, I can either search for the account number (10's of millions for global banks) and then search for the transaction date (billions of transactions per day or week) …. and do a Sort Merge Join.??

Ready.??Set.?Go.??

See you next year…..or spend a ton of money on hardware – take your pick.

The net effect is that most RDF benchmarks such as StarDog’s infamous “Trillion Edge Knowledge Graph” informercial produced by McKnight Consulting Group (sponsored by StarDog, itself, of course) was based strictly on point queries in which a single key index could be leveraged…..or it was passed through to a RDBMS (as a virtual graph) which supported composite key indices.??Yay.??Benchmarks – how to lie, cheat and steal.??Kinda like statistics – you can make up anything from them you want to prove.

But it also explains why Oracle only ever benchmarks PGX – which is more comparable to Neo4j’s GDS implementation of in-memory graph projections – in which any in-memory graph projection is a refactoring of the knowledge graph into monopartite and bipartite models.??A true knowledge graph typically supports graph analytics queries involving deep multi-hop traversals across the graph – and quite often using variable-length expressions such as (:Component)-[:IS_PART_OF*]->(:Assembly) – the asterisk inferring any number of traversal – possibly 50 or more.????Given how Oracle stores the graph data – I wouldn’t even want to try that…..

Getting back to a knowledge graph combining an ontology and digital twins – a digital twin likely means scads and scads of data.??For example, if you are making a digital twin of a supply chain to support defect root cause analysis, you would need to create a digital twin not only for every end product (aka finished good) produced, but also the intermediate components – and the machine instances used to produce them – and the employees operating them (was it a lack of training that led to the defect?) – the packaging, the transportation (was the defect caused by falling off the truck?) - and so on.

Net, net:?RDF for an ontology definition is fine…..just don’t use it for a knowledge graph.

Rajendra Singh Negi

2 年

Nice!!

Anthony Krinsky

Playing with passion.

2 年

Smart

1 次回应

Joseph Shaffner

2 年

Jeff, your passsion for clarifying complex technology is admirable!

Maya Natarajan

2 年

Well written article, Jeff Tallman!

查看更多评论

要查看或添加评论，请登录

Jeff Tallman的更多文章

Time Series Vectors in Neo4j

2024年6月19日

Time Series Vectors in Neo4j

What really is time series data? When people of time series data, the context is not always the same. In some cases, it…

2 条评论
Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

2024年4月4日

Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

For those of you near to the "Big Apple" and willing to brave the arduous trek into the city..

3 条评论
The Challenges of Virtual Knowledge Graphs

2023年12月21日

The Challenges of Virtual Knowledge Graphs

Nothing is new under the sun. The idea of virtual access to data in remote (and often heterogeneous) data stores is…
Vector Indexing plus Knowledge Graphs with Neo4j

2023年11月21日

Vector Indexing plus Knowledge Graphs with Neo4j

TL;DR: Normally vector indexing is thought of as a common implementation for Generative AI. While this is true at a…

4 条评论
Modeling Longitudinal/Time Series/Sequential Data in Neo4j

2023年7月31日

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

…and Why Analyzing It Is a Graph Problem One of the most common modeling questions we get asked is how to model time…
Modeling Discrete Relationships in Neo4j

2023年7月13日

Modeling Discrete Relationships in Neo4j

Graph Modeling One of the comments on my earlier blog was about graph modeling – it is an area that has tremendous…

1 条评论
Graph Feature Engineering for Longitudinal Data (aka Time Series)

2023年6月24日

Graph Feature Engineering for Longitudinal Data (aka Time Series)

TL;DR: Graph feature engineering in Neo4j for longitudinal (aka time-series) data allows path analysis and other…

2 条评论
DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

2023年6月12日

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

A prospective customer asked me a question the other day – why use Neo4j for data science instead of my [existing]…

5 条评论
Hamiltonian Paths

2022年11月14日

Hamiltonian Paths

Last time I posted – admittedly a while ago…..
Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

2022年4月4日

Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

If you are a student of DBMS technologies, then you know that there are four basic join types: Nested Loop (NLJ) – Uses…

1 条评论

See all articles

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

Jeff Tallman

Principal Sales Engineer at Neo4j

TL;DR: Why RDF databases don’t scale well for Knowledge Graphs

Ontology.

Digital Twin.

Knowledge Graph.

领英推荐

Jeff Tallman的更多文章

社区洞察

其他会员也浏览了

The Anatomy Of Data?Science

Market Weave - Insights by SILK, Week 42

Unleashing The Power of Data Science: Revolutionizing the World

Hello 2023 and hello data transformation

Kernel Trick and HNSW Vector Databases for Efficient Classification and Nearest Neighbor Search

Stuck in the Muck: Big Data means Big Problems

Prospective Analytics - A New Frontier in Data Science?

Stay Ahead of the Curve: Trending Data Science Projects for 2024

Why Data Science is a Trending Technology and Why You Should Learn It

Creative Mindset - The Unsung Hero in Data Science

TL;DR: Why RDF databases don’t scale well for Knowledge Graphs

Ontology.

Digital Twin.

Knowledge Graph.

领英推荐

Jeff Tallman的更多文章

Time Series Vectors in Neo4j

Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

The Challenges of Virtual Knowledge Graphs

Vector Indexing plus Knowledge Graphs with Neo4j

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

Modeling Discrete Relationships in Neo4j

Graph Feature Engineering for Longitudinal Data (aka Time Series)

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

Hamiltonian Paths

Friend of a friend (aka Relationship Isomorphism …..and how to saturate a CPU in spite of it)

社区洞察

其他会员也浏览了

The Anatomy Of Data?Science

Market Weave - Insights by SILK, Week 42

Unleashing The Power of Data Science: Revolutionizing the World

Hello 2023 and hello data transformation

Kernel Trick and HNSW Vector Databases for Efficient Classification and Nearest Neighbor Search

Stuck in the Muck: Big Data means Big Problems

Prospective Analytics - A New Frontier in Data Science?

Stay Ahead of the Curve: Trending Data Science Projects for 2024

Why Data Science is a Trending Technology and Why You Should Learn It

Creative Mindset - The Unsung Hero in Data Science