Counterfeit Knowledge Graphs
When we progress from data to knowledge, there is what physicists call a phase change like the change from water to ice or from mud to brick. The ingredients are the same throughout the transition, but we compress and restructure these ingredients into something entirely new with dramatically different characteristics. Like mud and brick, data and knowledge are not the same thing.
As I've argued elsewhere, data is the new mud. We can wallow in mud and easily make a huge mess with it. But we can’t shape it and build complex structures with it. Large quantities of mud require significant effort and resources to gather, direct, and contain it.? In fact, it’s much more of a problem than a resource. After we transform mud into bricks and data into knowledge, though, it's easy to store, to transport, and to assemble into walls, arches, towers, and castles. Bricks are much more valuable than mud, even though the ingredients are the same. We need to focus more on transforming muddy data into solid bricks of knowledge. We need to move from Big Data to Big Knowledge.
But I continue to see lots of work on “knowledge graphs” that ignores the distinction between mud and bricks, between (text) data and knowledge, between strings and concepts. Many graph enthusiasts make the painfully na?ve assumption that text (words and sentences) is the same thing – with the same characteristics – as knowledge.? They re-format sentences as graphs of strings and try to sell the result as "knowledge".?
It's as if they're printing Monopoly money and declaring themselves millionaires:? they're building counterfeit knowledge graphs that kind of, almost look like the real thing.?
One clear example is the recently released work from Microsoft on GraphRAG. How disappointing!? The researchers promise that they can "automate the extraction of a rich knowledge graph from any collection of text documents" but they deliver little more than hierarchical clustering of term strings as a pre-processing step for string-based RAG. They conclude yet again what we've already known for decades:? that indexing on relations (however vague) between terms helps to identify related text snippets. They over-promised and under-delivered: their knowledge graphs are counterfeit. Here's why.
领英推荐
It's easy to see how this conflation of strings and concepts can arise. Humans store knowledge in text:? messages, emails, articles, reports, books, etc. The meaning of most words is seamlessly and effortlessly transparent to readers – at least in their own language. Everyday experience makes it seem like the transition from letters or sounds to meanings is so simple and predictable that they are basically the same thing. Add to that the fact that most engineers and AI enthusiasts haven't any training at all in the nuances and gory details of language or cognition – they're normally na?ve, novice, non-practitioners in the language space. So of course they're not familiar with the most basic assumption in the study of signs and symbols:? that symbols like words are always made up of two parts, a directly observable, tangible "signifier" (like strings, sounds, gestures, or graphics) and an unobservable, mental entity called the "signified" or the meaning (i.e., the concepts or ideas that we attach to signifiers when we interpret them). As a result of this yawning gap in engineers' training, the vast majority of today's "symbolic" processing consists of ignoring meaning and slinging around inert signifiers like strings and pixels.?Useful? Undoubtedly. But symbolic? Not at all.
There is a broad and growing consensus that a focus on the rich, reliable knowledge graphs of the kind described here will enable us to move beyond the current reliance on very large, very variable, and very expensive data collections to a new generation of compact, transparent, and reliable AI systems fueled by conceptual knowledge.
To focus better and avoid counterfeit products of dubious value, we need to be very clear about what our knowledge graphs have to look like.
Javascript Developer, DeepRL, Prompt Engineering, Model Coercion
2 个月A word without context is just a word
Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights
3 个月Am I interpreting too much into this, or are you stating that designing a knowledge is a form of art? And any automatically derived form of knowledge graph (scraped/indexed) are counterfeit and possibly hindering the development of proper KGs?
Founder & CEO at OpenLink Software | Driving GenAI-Based Smart Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)
4 个月There’s a serious issue at stake regarding the generic use of the term “Knowledge Graph,” which cloaks a lot of data access, integration, and management fundamentals. For instance, on a use case basis, the case for ontologies often ends up being shrouded in debates that are generally confusing, while the “Stringy Knowledge Graphs” (or Labelled Property Graphs [LPGs]) tend to focus on recommendations and the use of graph [actually network] analytics-oriented algorithms, which appear to be less confusing. When these superficial memes start circulating, we need to respond with simple, and more importantly, demonstrable rebuttals. For instance, ask about solving for recommendations deductively offered to cousins, nephews, siblings, uncles, aunties, co-workers, employers, etc. That example is generally easy for an audience to understand and is simply unanswerable by any LPG or conventional Relational Database Management System (RDBMS) due to the underlying needs for rules, reasoning, and inference, informed by an ontology comprising fine-grained machine-computable entity relationship type semantics. Even simpler, ask about explicit or implicit coreference re solving for diverse entity naming. #DBMS #KnowledgeGraph #Issues
Founder & CEO at OpenLink Software | Driving GenAI-Based Smart Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)
4 个月Late to the party on this one. Great post ??
Architect van maatschappelijke informatiestelsels at Le Blanc Advies
4 个月There is no progress from data to knowledge, only reduction of knowledge to data. Knowledge is the postponement of data.