登录查看更多内容

Superpowers of Knowledge Graphs, part 1: Data Integration

Mike Dillinger, PhD

发布日期: 2023年7月14日

There is lots of writing about use cases of knowledge graphs across different domains, but I don't see many explanations of why knowledge graphs actually work in these cases.?

The answer is simple:? knowledge graphs have superpowers above and beyond those of ontologies, taxonomies, and mere databases. Knowledge graphs can't jump tall buildings in a single bound, but they do have something akin to x-ray vision and other powers.?

One key superpower of knowledge graphs is data integration. (See other superpowers like data aggregation and taming LLMs.)

Imagine that you have a collection of structured knowledge resources – databases, spreadsheets, taxonomies, ontologies, and knowledge graphs – that you need to leverage, access, or query as if they were actually one database. For example, the EU's ESCO, the US Government's O*Net, The World Economic Forum's taxonomy, Lightcast's (formerly EMSI's) taxonomy, and LinkedIn's proprietary knowledge graph all contain statements about job skills in different formats, based on different resources, created with different methods and assumptions for different purposes. They each have different levels of accuracy, coverage, and depth across different domains – and were developed with different degrees of technical expertise.? They contain complementary facts about the same or related entities of interest – enough that it makes sense to consider merging them to create an integrated resource. It's clear that this integration will yield much better coverage of skills (and perhaps better depth of description), but it's not so clear how integration will affect the accuracy of this data.

The good news is that the facts in all of these kinds of resources can be converted to a common syntax:? a row in a spreadsheet or database denotes some entity that shows up as the subject node in ontologies and knowledge graphs [see my Facts and Types of Facts cheat sheet]. So a row in a spreadsheet corresponds to the "neighborhood" of triples that have the same subject – i.e., an array of features.? The column labels in the spreadsheet are the predicates or relations in ontologies or knowledge graphs.? And the values at row-column coordinates are the object nodes in knowledge graphs.? Each link in a taxonomy tree can also be formatted as a child - predicate - parent triple, as in other kinds of knowledge graphs. These formats are reliably interchangeable, but choosing one of them determines what your tech stack will look like and determines the operations that you can and cannot perform on your knowledge store:? just try to do multi-hop queries in Excel! For the moment, let's assume a subject-predicate-object triple format so we use a unified terminology.

The bad news is that once we convert all our resources to (for example) a triple format, the nodes and predicates don't necessarily have a common semantics:? they were created by different teams making different assumptions about what a given predicate or node means. For example, in the resources cited above, each team defined "skill" differently, so they include or exclude different entities as skills, based on different criteria – which affects all the data in each resource.

Similarly, a feature, predicate, or tag like isaSmallBusiness will mean "less than 20 employees" in one knowledge resource and "less than 100 employees" in another. Also, triples with "Paris" as subject node might refer to a city in France, another in Texas, a company or subsidiary that happens to be located there, or a person – across different resources. "Sales this quarter" can denote anything from funds received to non-binding oral agreements to wishful projections. This problem of ambiguity is made catastrophically worse by the fact that most creators of knowledge resources neglect to curate and document (e.g., in some kind of data dictionary or catalog) the concepts that they use. The concepts might be clear to the developers, but they're not clear to anyone else.?

The flip side of this problem is that nothing prevents different teams from using wildly different labels for the same entities in different resources, which means that string matching across resources is essentially useless.? Every data scientist, at one time or another, has faced the problem of estimating the number of something like software engineers only to realize that the results are dramatically off because application developer, SWE, programmer, solutions architect, Java ninja, Entwickler, ingénieure logiciel, and literally thousands of other strings are used as labels for this same entity. If there are hundreds or thousands of synonymous strings and we query on only one or two or five strings, then our resulting analytics or query results are simply junk.?

领英推荐

Maximizing the Potential of Web Data Extraction

Forage AI 8 个月前

What is a Vector Database?

Michele . 2 年前

The Data Science Lifecycle

Sankhyana Consultancy Services Pvt. Ltd. 5 个月前

Knowledge graphs to the rescue

Enter the knowledge graph as a tool for integrating data across multiple resources.? We noted above that many knowledge resources can be converted into a common format as knowledge graph triples.? Choosing (or developing) a knowledge graph for data integration also means using it to provide an initial common, objective semantics as a pivot or point of reference for all the different knowledge resources that we want to integrate.?

We speak of initial semantics because each new knowledge resource will contain new entities and new predicates that we can use to expand and enrich the original reference knowledge graph.? We speak of common semantics because the semantics of the reference knowledge graph will be shared across all the resources that map to it, and will enable queries across all resources. And we speak of objective semantics because in knowledge graphs the definition of each node is made explicit in the form of triples with that node as subject, and this enables more systematic validation and more accurate mapping from multiple resources. Not only can we review, validate, and curate these definitions as needed, but the conceptual components of these definitions are in a format accessible to the algorithms, not only to human reviewers.

The key idea is to integrate the different resources by mapping them all to the same reference knowledge graph, using the semantics of that knowledge graph as the standard -- or what is called a semantic integration layer.? This is a kind of soft standardization that maintains the entities and predicates of the original resources rather than laboriously converting them to the entities and predicates of the new standard (i.e., hard standardization). The mapping from each resource to the reference knowledge graph is soft in the sense that it is flexible and separate from the data itself, can be updated without affecting the original data, and might take the form of rules, algorithms, or human assignments.? A key part of the mapping is establishing correspondences between the predicates or features in the resources and those in the reference knowledge graph.? Doing this first makes the integration task more practical because there are far fewer predicates than nodes in any resource and we can then leverage mapped predicates to estimate the similarity of nodes for validation.

This many-to-one approach of mapping to a reference is a more general version of building (one-to-one) crosswalks, for example between industry taxonomies such as SIC and NAICS (example here). Crosswalks map arbitrary pairs of resources often without taking one to be the standard. In cases where there are many resources, it is far simpler to map each to a single reference rather than each to each other, so a much more scalable approach.?

So which specific characteristics give knowledge graphs their superpowers as a target or reference for data integration -- more powerful than taxonomies, ontologies, or databases? Here are some of them:

Knowledge graphs don't rely on additional external resources like data dictionaries or data catalogs (which are very often missing from taxonomies and databases) – they're self-defining and objective in that definitions of nodes are made explicit as triples with the same subject. Missing data dictionaries signal unclear and unreliable semantics, which makes any mapping difficult.
Knowledge graphs make node definitions machine readable, not only human-readable, unlike taxonomies. Without this it is very difficult to (even partially) automate any mappings or validation.
Knowledge graphs define concepts explicitly in terms of other explicit concepts – a kind of cognitive recursion, unlike other resources that define concepts in terms of only-human-interpretable strings. Without this it is very difficult to offer semantic evidence for mappings.
Knowledge graphs include both individual instance and category nodes, as well as literal values and facts, unlike ontologies, which usually include only category or class nodes.? This makes it very difficult to map taxonomies and knowledge graphs to a reference ontology.
Knowledge graphs include many kinds of predicates, unlike taxonomies which usually include only instanceOf and subcategoryOf predicates. This makes it very difficult to map ontologies and knowledge graphs to a reference taxonomy.
Knowledge graphs include many kinds of entities, unlike taxonomies which usually focus only on the type of entity described by the root node. This makes it very difficult to map the many entities of ontologies and knowledge graphs to a reference taxonomy.

Knowledge graphs have other superpowers, too. Which of them have you seen in practice?

Stefan Reinhardt

“What we know is a drop; what we don't know, is an ocean" (Sir Isaac Newton) #EnterpriseKnowledgeGraph #eccencaCorporateMemory #KnowledgeCentricity

1 年

Very cool post! ?? Thanks Mike ????

1 次回应

Roy Roebuck

Holistic Management Analysis and Knowledge Representation (Ontology, Taxonomy, Knowledge Graph, Thesaurus/Translator) for Enterprise Architecture, Business Architecture, Zero Trust, Supply Chain, and ML/AI foundation.

1 年

Useful description. Your description aligns with my own approach, used since 1981, described often on LinkedIn and elsewhere. I use vocabulary term extraction from domain content, collection of vocabulary term "part of speech" definitions from that domain and its environment, modeling of triples and "triple to triple" structures using defined-term nodes and predicates, hypernym to hyponym organization of defined terms into a governed taxonomy, using defined terms to create a translation thesaurus (with "preferred term+definition") of the collected vocabulary, and other steps. My node root types, corresponding to basic questions, are: location, endeavor, performer, function, activity, resource, required capability. My predicate root types are: equivalence, containment, categorization, sequence, version, variance, description.

1 次回应

Sebastian Sigloch

1 年

Thanks for sharing.

1 次回应

Kyle McCabe

PEO Consultant

1 年

Thanks for sharing!

1 次回应

Barton Friedland, PhD

Strategic Advisor | Human-Centric AI & Augmentation Expert | Driving Organisational Change & Process Excellence

1 年

What a great and timely post

查看更多评论

要查看或添加评论，请登录

Mike Dillinger, PhD的更多文章

Knowledge Graphs: Artificial Knowledge for Artificial Intelligence

2025年3月14日

Knowledge Graphs: Artificial Knowledge for Artificial Intelligence

Intelligence is simply being good at thinking: at using what you know to make sense of what you don't. That might be…

59 条评论
Herding, Culling, and Caging Predicates for Knowledge Graph Relations

2025年2月7日

Herding, Culling, and Caging Predicates for Knowledge Graph Relations

Lists and bags and sets are jumbles of items that I find aberrant and abhorrent. So when I see people blithely invent…

14 条评论
Diversity, Depth, and Density of Knowledge Graph Relations

2025年1月13日

Diversity, Depth, and Density of Knowledge Graph Relations

At the top of my list of New Year’s resolutions this year is relation resolution. In the world of structured knowledge,…

15 条评论
New Year's Resolutions for your Knowledge Graphs

2024年12月23日

New Year's Resolutions for your Knowledge Graphs

As you enjoy your holiday season, I suggest two resolutions to consider for the New Year: entity resolution and…

18 条评论
Knowledge Graphs and Monkey Business with Generative AI

2024年12月9日

Knowledge Graphs and Monkey Business with Generative AI

Throughout the year I got poked and prodded and challenged in a bunch of different ways by my friends, colleagues, and…

7 条评论
Thanks for them Knowledge Graphs

2024年11月28日

Thanks for them Knowledge Graphs

It's Thanksgiving Day here in the US. A time to count one's blessings.

10 条评论
Knowledge Graphs are Essential for Safe AI

2024年11月11日

Knowledge Graphs are Essential for Safe AI

AIs will only be safe for general use when they have and use goals and values that are identical to those of humans. In…

30 条评论
Knowledge graphs, Linguists, and the Last-mile problem of AI

2024年11月4日

Knowledge graphs, Linguists, and the Last-mile problem of AI

Now that AI can generate fluent text at scale in multiple languages and different styles, are authors, translators…

22 条评论
Audio: How to make AI safe and reliable?

2024年10月21日

Audio: How to make AI safe and reliable?

Janie and Johnny are back for Episode 2 of my Byte-sized AI series! Listen in to these engaging, bite-sized podcasts to…
Audio: What are Knowledge Graphs?

2024年10月1日

Audio: What are Knowledge Graphs?

Who knew? It seems that Max Headroom had blue-eyed twins and they're all grown up! I suspect that he sent them to…

10 条评论

See all articles

Superpowers of Knowledge Graphs, part 1: Data Integration

Mike Dillinger, PhD

领英推荐

Knowledge graphs to the rescue

Mike Dillinger, PhD的更多文章

社区洞察

其他会员也浏览了

COMPARATIVE DATA MODELS

Transforming Data into Insights: The Evolution of Data Analytics

Analytics and Data Science News for the Week of September 13; Updates from AnswerRocket, Luzmo, Qlik & More

The Vision Behind Compute.AI: Empowering Enterprises for a New Era of Data Intelligence

Empowering Data Teams: Tools and Libraries for Building Effective Semantic Layers

Diving into the Deep End of RDF: OWL, SHACL, and SPARQL, vs TerminusDB data products

A Unified Approach to Data Science Workflows in R Studio for Superior Analytical Outcomes

Benefits and Opportunities in Data Science & Business Intelligence

The Growing Use Of Humanized Big Data

Embracing Code-Only Analytical Workflows with Bruin- Data Engineering

领英推荐

Knowledge graphs to the rescue

Mike Dillinger, PhD的更多文章

Knowledge Graphs: Artificial Knowledge for Artificial Intelligence

Herding, Culling, and Caging Predicates for Knowledge Graph Relations

Diversity, Depth, and Density of Knowledge Graph Relations

New Year's Resolutions for your Knowledge Graphs

Knowledge Graphs and Monkey Business with Generative AI

Thanks for them Knowledge Graphs

Knowledge Graphs are Essential for Safe AI

Knowledge graphs, Linguists, and the Last-mile problem of AI

Audio: How to make AI safe and reliable?

Audio: What are Knowledge Graphs?

社区洞察

其他会员也浏览了

COMPARATIVE DATA MODELS

Transforming Data into Insights: The Evolution of Data Analytics

Analytics and Data Science News for the Week of September 13; Updates from AnswerRocket, Luzmo, Qlik & More

The Vision Behind Compute.AI: Empowering Enterprises for a New Era of Data Intelligence

Empowering Data Teams: Tools and Libraries for Building Effective Semantic Layers

Diving into the Deep End of RDF: OWL, SHACL, and SPARQL, vs TerminusDB data products

A Unified Approach to Data Science Workflows in R Studio for Superior Analytical Outcomes

Benefits and Opportunities in Data Science & Business Intelligence

The Growing Use Of Humanized Big Data

Embracing Code-Only Analytical Workflows with Bruin- Data Engineering