Counterfeit Knowledge Graphs

Counterfeit Knowledge Graphs

When we progress from data to knowledge, there is what physicists call a phase change like the change from water to ice or from mud to brick. The ingredients are the same throughout the transition, but we compress and restructure these ingredients into something entirely new with dramatically different characteristics. Like mud and brick, data and knowledge are not the same thing.

As I've argued elsewhere, data is the new mud. We can wallow in mud and easily make a huge mess with it. But we can’t shape it and build complex structures with it. Large quantities of mud require significant effort and resources to gather, direct, and contain it.? In fact, it’s much more of a problem than a resource. After we transform mud into bricks and data into knowledge, though, it's easy to store, to transport, and to assemble into walls, arches, towers, and castles. Bricks are much more valuable than mud, even though the ingredients are the same. We need to focus more on transforming muddy data into solid bricks of knowledge. We need to move from Big Data to Big Knowledge.

But I continue to see lots of work on “knowledge graphs” that ignores the distinction between mud and bricks, between (text) data and knowledge, between strings and concepts. Many graph enthusiasts make the painfully na?ve assumption that text (words and sentences) is the same thing – with the same characteristics – as knowledge.? They re-format sentences as graphs of strings and try to sell the result as "knowledge".?

It's as if they're printing Monopoly money and declaring themselves millionaires:? they're building counterfeit knowledge graphs that kind of, almost look like the real thing.?

One clear example is the recently released work from Microsoft on GraphRAG. How disappointing!? The researchers promise that they can "automate the extraction of a rich knowledge graph from any collection of text documents" but they deliver little more than hierarchical clustering of term strings as a pre-processing step for string-based RAG. They conclude yet again what we've already known for decades:? that indexing on relations (however vague) between terms helps to identify related text snippets. They over-promised and under-delivered: their knowledge graphs are counterfeit. Here's why.

  • Knowledge graphs compress the vast diversity and redundancy of strings into unique concepts. Distinguishing nodes in knowledge graphs doesn't depend on spelling but on meaning – so nodes can't be simple strings. If there are distinct nodes in a graph for dog, pooch, mutt, stray, mongrel, hound, cur, man's best friend, and others, then it does not model unique concepts. Clustering strings is at best a crude approximation to this. So graphs with strings as nodes aren't knowledge graphs.
  • Knowledge graph nodes enable unpacking a concept to display and leverage information about the components and characteristics of things in the world. Nodes in knowledge graphs are pointers, not literal values like strings, so they point to and document meaning for our algorithms. Clusters and embeddings of strings like 'elephant' include no information about weights, heights, trunks, and tusks – only about other strings in sentence context.
  • Knowledge graph nodes are unique, not ad hoc agglomerations like Actors and Directors or Influencers and Entrepreneurs. Clusters of strings are notoriously difficult for humans to label and evaluate because the criteria for clustering (other strings in context) are so different from what humans use to delimit concepts (components and characteristics). Clusters of strings, then, only sometimes and approximately correspond to the categories that humans define and manipulate to build knowledge.
  • Knowledge graphs have varied types of relations. Hierarchical clustering only yields one single, fuzzy notion of "relatedness" between strings, one type of relation – an important but very small subset of what goes into a knowledge graph. Rich knowledge graphs require relations of different types between categories: partOf, usedFor, hasColor, etc. to document a range of facts about things in the world, as well as relations between individuals like spouseOf or createdBy.?Clustering provides none of these.
  • Clustering and embeddings do no more than situate strings as somehow related to other strings. Knowledge graphs enable algorithms to interpret strings, i.e., to map strings to conceptualizations of things in the world.

It's easy to see how this conflation of strings and concepts can arise. Humans store knowledge in text:? messages, emails, articles, reports, books, etc. The meaning of most words is seamlessly and effortlessly transparent to readers – at least in their own language. Everyday experience makes it seem like the transition from letters or sounds to meanings is so simple and predictable that they are basically the same thing. Add to that the fact that most engineers and AI enthusiasts haven't any training at all in the nuances and gory details of language or cognition – they're normally na?ve, novice, non-practitioners in the language space. So of course they're not familiar with the most basic assumption in the study of signs and symbols:? that symbols like words are always made up of two parts, a directly observable, tangible "signifier" (like strings, sounds, gestures, or graphics) and an unobservable, mental entity called the "signified" or the meaning (i.e., the concepts or ideas that we attach to signifiers when we interpret them). As a result of this yawning gap in engineers' training, the vast majority of today's "symbolic" processing consists of ignoring meaning and slinging around inert signifiers like strings and pixels.?Useful? Undoubtedly. But symbolic? Not at all.

There is a broad and growing consensus that a focus on the rich, reliable knowledge graphs of the kind described here will enable us to move beyond the current reliance on very large, very variable, and very expensive data collections to a new generation of compact, transparent, and reliable AI systems fueled by conceptual knowledge.

To focus better and avoid counterfeit products of dubious value, we need to be very clear about what our knowledge graphs have to look like.
Allan M.

Javascript Developer, DeepRL, Prompt Engineering, Model Coercion

2 个月

A word without context is just a word

回复
Daniel Lundin

Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights

3 个月

Am I interpreting too much into this, or are you stating that designing a knowledge is a form of art? And any automatically derived form of knowledge graph (scraped/indexed) are counterfeit and possibly hindering the development of proper KGs?

回复
Kingsley Uyi Idehen

Founder & CEO at OpenLink Software | Driving GenAI-Based Smart Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)

4 个月

There’s a serious issue at stake regarding the generic use of the term “Knowledge Graph,” which cloaks a lot of data access, integration, and management fundamentals. For instance, on a use case basis, the case for ontologies often ends up being shrouded in debates that are generally confusing, while the “Stringy Knowledge Graphs” (or Labelled Property Graphs [LPGs]) tend to focus on recommendations and the use of graph [actually network] analytics-oriented algorithms, which appear to be less confusing. When these superficial memes start circulating, we need to respond with simple, and more importantly, demonstrable rebuttals. For instance, ask about solving for recommendations deductively offered to cousins, nephews, siblings, uncles, aunties, co-workers, employers, etc. That example is generally easy for an audience to understand and is simply unanswerable by any LPG or conventional Relational Database Management System (RDBMS) due to the underlying needs for rules, reasoning, and inference, informed by an ontology comprising fine-grained machine-computable entity relationship type semantics. Even simpler, ask about explicit or implicit coreference re solving for diverse entity naming. #DBMS #KnowledgeGraph #Issues

回复
Kingsley Uyi Idehen

Founder & CEO at OpenLink Software | Driving GenAI-Based Smart Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)

4 个月

Late to the party on this one. Great post ??

回复
Paul Oude Luttighuis

Architect van maatschappelijke informatiestelsels at Le Blanc Advies

4 个月

There is no progress from data to knowledge, only reduction of knowledge to data. Knowledge is the postponement of data.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了