Knowledge Graphs and Layers of Value, part 3
Image by you.com

Knowledge Graphs and Layers of Value, part 3

This post is the last in a triptych about how to build Knowledge Graphs and the layers of value that they add to data and to applications.

The first post described the first steps in – and the first layer of value from – building a knowledge graph. These initial steps identified the key nodes or concepts in a domain – the concept nodes in the graph – and the various terms that express each, producing what I call a concept catalog. More technically, this is a simple method of entity resolution – which already adds value to data. The concept catalog includes two key types of concepts so far:? those that refer to individual entities (like The Eiffel Tower, Lake Baikal, or Barack Obama) and those that denote categories (or categorical entities), which aggregate similar individuals (like buildings, hydrographic features, or politicians).

The second post described the next steps in – and the second layer of value from – building a knowledge graph. These steps identified the key conceptual relationships and characteristics in a domain and the various terms that express each.? They'll be the edges or “predicates” that connect the nodes in the graph. The characteristic-type edge highlights an additional kind of node, as well:? literal nodes (or "values") that refer to specific literal values like numbers, date, times, or labels which are not treated as concepts in most knowledge graphs. These edges (which denote relationships and characteristics) constitute a kind of language for describing facts, more technically a metalanguage for knowledge representation. At this stage, the concept catalog has two more kinds of concepts:? relationships (between entities) and characteristics that link entities and literal values. Labeled property graphs are important here because they enable us to treat edges as real concepts that participate in relationships and have characteristics – unlike relational databases, for example.

What's left to do?

Contrary to popular belief, even though we have defined (individual, category, and literal) nodes and (relationship vs characteristic) edges – the components of any graph – we still do not have enough information to call this a knowledge graph. We only have the makings of a text graph, which humans can easily transform into a knowledge graph but computers cannot. Yet.

The crux of the matter – and the key difference between text graphs and knowledge graphs – is our (temporary) assumption that:?

A concept is an equivalence class of terms.?

It makes great sense operationally to start with this assumption, since most of the knowledge we want to capture and store comes from text.? But although sequences of characters are usually meaningful to us – the human developers and users of text-based systems – to our computers they are empty, inert, and arbitrary collections of characters with no meaning and no relation to the world. When an algorithm looks up a string like "monotreme" to find more information, all it finds is an array of ASCII characters like [\x109, \x111, \x110, \x110, \x116, \x114, \x101, \x109, \x101], i.e., ["m", "o", "n", "o", "t", "r", "e", "m", "e"].? If the "concept" of dog is an equivalence class of terms like ["pooch", "cur", "mutt", "mongrel", "hound", "canine", "doggy", "puppy", …], it's still a collection of ASCII strings. We've just added a smidgen of semantics:? that these terms are somehow equivalent. And although useful, that information is nowhere near enough to support reasoning and intelligence – or processing based on meaning rather than on spelling.

Concepts as collections of facts

The next step is by far the most powerful and most valuable one:? it's moving away from a view of concepts as a collection of terms or strings to one where

Concepts are a collection – in fact a structured collection – of facts.

This step is what gives knowledge graphs their extraordinary superpowers for data aggregation, data integration, and taming language models.?

A collection of facts gives the node that they share its meaning – in technical terms, its grounding – according to the kinds of facts that are available. When an entity node like cat is related to other well-defined concepts, as in cats subcategoryOf mammals, it gets a part of its meaning from the other concept. This is called conceptual grounding, a key part of abstract reasoning.? When an entity node like cat is related to characteristics with specific values like cats haveAverageWeight 10lbs or cats haveAverageLenth 18in, the concept gets a different part of its meaning through sensation or measurement. This part is called perceptual (or sensory) grounding, a key component of "embodied cognition". There are, of course, other kinds of grounding involved in concept formation, but these are by far the two most important ones. Knowledge graphs serve to systematically accumulate and document these components of meaning – the grounding – in a way that's both explicit and accessible to algorithms.

So knowledge graphs have this really neat, really crucial feature:? their entity nodes are self-defining.? Each entity node is defined and described by the collection of facts that include it – i.e.,? the subgraph or graph neighborhood around it.? As we document and accumulate more facts about a particular entity, we enrich its description iteratively – we store more and more knowledge about it. This collection of entity-specific facts makes explicit the characteristics, components, and relationships that define what we mean by entity Q39792 – not only what we call it – so these facts constitute the semantics of the entity and the knowledge that we have stored about it.

For example, the Wikidata concept of Jack Nicholson, the American actor, is summarized by the facts (or triples) below, as well as a fact like [Q39792 hasLabel "Jack Nicholson"]:

With a knowledge graph, when an algorithm looks up or "unpacks" Jack Nicholson, instead of finding an array of ASCII characters, it retrieves a collection of facts about the real-world entity called Jack Nicholson rather than information about the string "Jack Nicholson". This is easy to check because both real world and fictional entities have very different characteristics from strings.? Real world elephants do not have characters as parts or font color as a characteristic.? Their height and length are measured in feet or meters, not in characters or pixels. Strings and entities are very different things.

Next step:? Add more facts about your entities, more entities, more edges.

In the previous steps, we identified important entities as well as important relationships between them and their characteristics. With a standard node-edge-node format we can accumulate explicit facts about these entities in the form of triples.?

You want the facts of the knowledge graph that you will use in production to cover the most important facts for your use case.? Not any possible fact but the ones that consistently describe (or are queried about) the entities in your domain.??

Prioritization and relevance of facts for the same entity varies dramatically as a function of task and goals. For a movie knowledge graph, the facts about Jack Nicholson will emphasize his relationships with movies and industry awards. For a patient knowledge graph, the facts about Jack Nicholson will focus on treatments and symptoms like memory loss and social withdrawal. For a fan-driven celebrity knowledge graph, the facts will focus on idiosyncrasies like his use of sunglasses and never having done a talk show or bloopers in his movies.

One key characteristic of knowledge graphs is that every node, edge, and triple will be defined only once and in significantly more detail than just providing a label or a human-readable definition: a node or edge ID is a pointer to the collection of facts that describe or define it, not a string. So the facts or triples of the knowledge graph will be built using only the vocabulary of these well-defined nodes and edges, as well as any necessary literal values like labels, dates, and numbers.? This process makes the components of each concept’s meaning machine accessible, not only human-readable. Of course, this vocabulary is extensible, as well:? the key is to avoid duplicate or redundant entities or relations.

A new Layer of Value

The first phase of knowledge graph building (in my previous post) identifies the most important entities and the many terms used to identify each of them. When done carefully, this implements a single semantic relation: synonymOf. This yields dramatic data compression and adds value in the different ways described there.

The second phase described (in my second post) identifies the most important relationships and characteristics of entities – and aggregates the terms used to define them. This kind of data compression based on adjectives and measures implements a different semantic relation: instanceOf but focuses on aggregating attributes rather than entities.?

The current step adds yet another layer of value by leveraging the entities, relationships, and characteristics described before as a vocabulary for constructing the facts – in the form of knowledge graph triples – that describe and define the components of meaning for each concept.? The relationships and characteristics each document their own semantic relations – conceptual components of meaning that enable not only dramatic data compression (thousands of phrases or sentences can reduce to a simple triple) but also principled methods of data integration across different sources – the “semantic layer” bridging silos – and knowledge-driven data aggregation across instances.?

Building and accumulating knowledge graph triples makes two things explicit:?

  • the internal structure of key concepts – their components and the relations between these components – as well as
  • the relations between the concepts themselves.?

Access to components of meaning (instead of just labels for concepts) enables – and supercharges – meaning-based search, analytics, recommendations, and business operations generally.

For search and recommendations, indexing a knowledge graph rather than an ocean of sentences or word tokens yields a much smaller, more tractable, more transparent conceptual search space. This means better, faster searches and recommendations with less latency, less hardware, and more explainability.

For machine learning, knowledge graph triples define a space with richer, more interpretable, semantically aggregated facts as features – not just simplistic, independent tags or extremely variable, ad hoc strings. Because these triple-based features are easier to understand, user behavior and feedback provide clearer, more effective guidance for refining models.?

For analytics, knowledge graph triples enable counts over richer, interrelated facts instead of simple, independent features. This will help ensure the relevance, interpretability, and interoperability of the FAIR data that we need to create and share.

For asset management, we can move beyond standardized, independent tags to annotating – in fact, summarizing – video, audio, written, and computer-generated assets with explicit, unambiguous triples. As we enrich the sensory and linguistic diversity of the nodes and relations in the graph, cross-modal and cross-lingual search become more refined and more precise – regardless of the format, language, or modality of the original information.

For large language models, knowledge graphs implement conceptual guardrails that help avoid hallucinations, errors, and inappropriate outputs as well as a mechanism for compact injection of relevant, time-sensitive information through RAG architectures. These next-gen knowledge-infused language models are evolving rapidly.?

Knowledge graphs are already a very effective way to do more with less. Even when they’re not complete or coherent or carefully constructed.? Building them at higher levels of scale and coherence will create many more layers of increasing value.?

Knowledge-driven AI -- Artificial Expertise -- is the next frontier for research and product development.
Pietro La Torre

Data strategy??, management ????♂? and tales???

6 个月

Great article series!

回复
Daniel Lundin

Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights

9 个月

Thank you for sharing! Great series of posts. What's your approach in this final part in how to identify and select which ones to use? In the previous posts you've talked about frequency analysis, nouns and verbs. What's your methodology in identifying these relations? My quick and clean approach when developing a first version is to define them by adding the actual data and use case to the model and iteratively add them piece by piece (also utilizing my past experience to identify the patterns of course).

回复
Fabiola Aparecida Vizentim

Bibliotecária especialista em tratamento de dados, recupera??o de informa??o e representa??o de conhecimento com foco em negócio e tecnologia.

10 个月
Sylvie Lemieux, Ph. D.

Traductrice-réviseure indépendante | interprète | formatrice | créatrice du balado Au tour du traducteur

10 个月

Would you accept talking of your knowledge graphs to my colleague translators? My podcast is The Translators's turn (in french and in english, with simultaneous interpretation services, so that either french-speaking or english-speaking people can benefit from your knowledge). https://youtube.com/playlist?list=PLXhSBSyjDQ3PFfdWyq3V1VZnOgOKKFkfW&si=duBRTY-A148Tr9TN

回复
Rüdiger Schütz

Transforming data into actions @ SELLWERK // Change Manager (IHK) / Digital Transformation / Semantic Web Solutions / Knowledge Engineering / Change Management / Consulting

10 个月

I agree! It‘s things, not strings.

要查看或添加评论,请登录

Mike Dillinger, PhD的更多文章

社区洞察

其他会员也浏览了