Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

TL;DR: Why RDF databases don’t scale well for Knowledge Graphs

Let’s face it, technologists are the worst purveyors of making up words than politicians or even weathermen…..and that’s saying something.?When I was growing up, we had “flurries”, “snow squall” or “blizzard” – none of this “bomb cyclone”, “polar vortex”….and some crazy word that describes the layering of precipitation that Google can’t even find unless you know the word.??Don’t get me started on politicians – if they were weather forecasters, you’d still have to go outside to see what was really happening.??But….yep, we in the IT space are by far the worst offenders.??Three of my non-favorites have come up recently a lot …..let’s define them.??I had a lot of help from Wikipedia

Ontology.

  • Definition: a set of concepts and categories in a subject area or domain that shows their properties and the relations between them
  • Translated to English: Terms that describe all the business entities, their attributes and relationships between the entities
  • ?CyberTrends: Typically implemented in an RDF due to ease of exchange as RDF is W3C standard for exchanging ontologies
  • In Reality:?It’s a schema in meta-data.

Digital Twin.

  • Definition: A virtual representation that serves as the real-time digital counterpart of a physical object or process.
  • Translated to English: Everything you know about an entity from a business perspective
  • CyberTrends: The buzzword du jour – and often used to define it in ways that are over complicating the reality
  • In Reality: The stuff you care about and everything you know about it.

Knowledge Graph.

  • Definition: A knowledge base that uses a graph-structured data model or topology to integrate data.
  • Translated to English: A graph-based data repository that is focused on a business domain or set of business problems on an aspect of the business
  • CyberTrends: Unfortunately, often distorted to mean an RDF graph database
  • In Reality:?The intersection of Ontologies and Digital Twins

Now, that last one really bugs me.??I don’t know how many times I’ve heard someone say “Yep – we have a knowledge graph, do you support RDF?”??I used to think I was alone in this until after yesterday’s amazing intro for the Neo4j Connections event on Life Sciences, Dr. Jesus Barrasa essentially said the same thing – combining an ontology and digital twins was a path to a knowledge graph.??At least, that’s what I dummied down what he said.?It was either a case of “great minds thinking alike” or “fools seldom differ”.??But since he is a super brilliant guy, I’ll go with the former.

Which brings me back to RDF as a Knowledge Graph and why they fail.??Well….actually, they succeed at very low scale – it is when you put lots of stuff in them that they fall apart.??Let’s investigate why.??If you had to implement your RDBMS instead of an ER model, but modeled the ER model itself, you’d have an RDF database.??First, understand that Resource Data Framework (RDF) is a W3C standard model for data interchange on the Web.?That means it is more like XML XSD than a database.??Databases that implement RDF are more accurate known as “Triple Stores” as the physical storage is typically in key value pair system with <object><relationship><object> “triples” on disk.??If you designed your RDBMS that way, it would be something like:

No alt text provided for this image
A simple RDF model of People and Cars

Now, the important thing is that the red entities actually don’t have any data – just a unique identifier (GUID).?All the data values are in the blue entities.?A real ER schema might look like:

No alt text provided for this image
A simple ER model of People and Cars

And now you know the difference between an RDF triple store and a labeled property graph.?Why is this a scalability concern???It all has to do with indexability.??In a typical RDBMS, we all know that composite indices are very common.??The most common index type is B-Tree+ - in which the lower-level leaf nodes contain the index key values (the composite key values) and a page/row pointer to the row of data.?Now this is important.??For a labeled property graph, such as Neo4j, since properties are stored on the node, we can easily create a B-Tree+ index on composite keys.??RDF has a fundamental problem.??They typically are not graph stores but key-value pair stores.?Take Oracle’s graph implementation for example:

SQL> describe myGraphVT$

?Name?????? Null???? Type

?------------------------- -------- ----------------

?VID?????? NOT NULL NUMBER

?K????????????????? NVARCHAR2(3100)

?T????????????????? NUMBER(38)

?V????????????????? NVARCHAR2(15000)

?VN???? ????????????NUMBER

?VT???????????????? TIMESTAMP(6) WITH TIME ZONE

?SL???????????????? NUMBER

?VTS??????????????? DATE

?VTE??????????????? DATE

?FE???????????????? NVARCHAR2(4000)

?

?

SQL> describe myGraphGE$

?Name????? Null???? Type

?------------------------ -------- ------------------

?EID?????? NOT NULL NUMBER

?SVID????? NOT NULL NUMBER

?DVID????? NOT NULL NUMBER

?EL???????????????? NVARCHAR2(3100)

?K????????????????? NVARCHAR2(3100)

?T????????????????? NUMBER(38)

?V????????????????? NVARCHAR2(15000)

?VN???????????????? NUMBER

?VT???????????????? TIMESTAMP(6) WITH TIME ZONE

?SL???????????????? NUMBER

?VTS??????????????? DATE

?VTE??????????????? DATE

?FE???????????????? NVARCHAR2(4000)        

Cited from: https://docs.oracle.com/en/database/oracle/property-graph/20.4/spgdg/using-property-graphs-oracle-database.html#GUID-6146A453-671F-489C-AD72-2C86920A8494

Those "K" and "V" thingy's......yep - Key...and Value....actually Value stored in different ways and formats. EL is Edge Label ...maybe.

As a result, a composite index is a bit difficult to create as the leaf node page/row pointer would have to point to 2 different rows in storage.??So, as a result, most common RDF implementations resort to the old trick of Sort Merge Joins (SMJ) – ala this example query and query plan taken from StarDog’s documentation:

SELECT DISTINCT ?person ?name

WHERE {

? ?article rdf:type bench:Article .

? ?article dc:creator ?person .

? ?inproc rdf:type bench:Inproceedings .

? ?inproc dc:creator ?person .

? ?person foaf:name ?name

}

?

Distinct [#812K]

`─ Projection(?person, ?name) [#812K]

? `─ MergeJoin(?person) [#812K]

????? +─ MergeJoin(?person) [#391K]

????? │? +─ Sort(?person) [#391K]

????? │? │? `─ MergeJoin(?article) [#391K]

????? │? │???? +─ Scan[POSC](?article, rdf:type, bench:Article) [#208K]

????? │? │???? `─ Scan[PSOC](?article, dc:creator, ?person) [#898K]

????? │? `─ Scan[PSOC](?person, foaf:name, ?name) [#433K]

????? `─ Sort(?person) [#503K]

??????? `─ MergeJoin(?inproc) [#503K]

??????????? +─ Scan[POSC](?inproc, rdf:type, bench:Inproceedings) [#255K]

??????????? `─ Scan[PSOC](?inproc, dc:creator, ?person) [#898K]        

?Cited?from: https://docs.stardog.com/archive/7.7.1/operating-stardog/database-administration/managing-query-performance.html

?Now, if you’ve studied DBMS query optimization for very long, you shudder at the thought of SMJ’s as they are some of the slowest join techniques on the planet.??They also don’t work well on larger volumes as you either consume a ton of memory for the sort….or more typically the sort “spills” to disk – which rrreeaaallllyyy slows it down.

How important is composite key indexing???Consider that the most voluminous data in any database is not the business entities but the transactions.??For example, banking transactions vs. bank accounts. ??Yes, each transaction likely does have a single unique transaction ID that I can look it up by very quickly.??Nice demo.?But what if I only know an account number and the transaction date??Without a composite index key, I can either search for the account number (10's of millions for global banks) and then search for the transaction date (billions of transactions per day or week) …. and do a Sort Merge Join.??

Ready.??Set.?Go.??

See you next year…..or spend a ton of money on hardware – take your pick.

The net effect is that most RDF benchmarks such as StarDog’s infamous “Trillion Edge Knowledge Graph” informercial produced by McKnight Consulting Group (sponsored by StarDog, itself, of course) was based strictly on point queries in which a single key index could be leveraged…..or it was passed through to a RDBMS (as a virtual graph) which supported composite key indices.??Yay.??Benchmarks – how to lie, cheat and steal.??Kinda like statistics – you can make up anything from them you want to prove.

But it also explains why Oracle only ever benchmarks PGX – which is more comparable to Neo4j’s GDS implementation of in-memory graph projections – in which any in-memory graph projection is a refactoring of the knowledge graph into monopartite and bipartite models.??A true knowledge graph typically supports graph analytics queries involving deep multi-hop traversals across the graph – and quite often using variable-length expressions such as (:Component)-[:IS_PART_OF*]->(:Assembly) – the asterisk inferring any number of traversal – possibly 50 or more.????Given how Oracle stores the graph data – I wouldn’t even want to try that…..

Getting back to a knowledge graph combining an ontology and digital twins – a digital twin likely means scads and scads of data.??For example, if you are making a digital twin of a supply chain to support defect root cause analysis, you would need to create a digital twin not only for every end product (aka finished good) produced, but also the intermediate components – and the machine instances used to produce them – and the employees operating them (was it a lack of training that led to the defect?) – the packaging, the transportation (was the defect caused by falling off the truck?) - and so on.

Net, net:?RDF for an ontology definition is fine…..just don’t use it for a knowledge graph.

Anthony Krinsky

Playing with passion.

2 年

Smart

Jeff, your passsion for clarifying complex technology is admirable!

回复
回复

要查看或添加评论,请登录

Jeff Tallman的更多文章

社区洞察

其他会员也浏览了