Superpowers of Knowledge Graphs, part 1: Data Integration
There is lots of writing about use cases of knowledge graphs across different domains, but I don't see many explanations of why knowledge graphs actually work in these cases.?
The answer is simple:? knowledge graphs have superpowers above and beyond those of ontologies, taxonomies, and mere databases. Knowledge graphs can't jump tall buildings in a single bound, but they do have something akin to x-ray vision and other powers.?
One key superpower of knowledge graphs is data integration. (See other superpowers like data aggregation and taming LLMs.)
Imagine that you have a collection of structured knowledge resources – databases, spreadsheets, taxonomies, ontologies, and knowledge graphs – that you need to leverage, access, or query as if they were actually one database. For example, the EU's ESCO, the US Government's O*Net, The World Economic Forum's taxonomy, Lightcast's (formerly EMSI's) taxonomy, and LinkedIn's proprietary knowledge graph all contain statements about job skills in different formats, based on different resources, created with different methods and assumptions for different purposes. They each have different levels of accuracy, coverage, and depth across different domains – and were developed with different degrees of technical expertise.? They contain complementary facts about the same or related entities of interest – enough that it makes sense to consider merging them to create an integrated resource. It's clear that this integration will yield much better coverage of skills (and perhaps better depth of description), but it's not so clear how integration will affect the accuracy of this data.
The good news is that the facts in all of these kinds of resources can be converted to a common syntax:? a row in a spreadsheet or database denotes some entity that shows up as the subject node in ontologies and knowledge graphs [see my Facts and Types of Facts cheat sheet]. So a row in a spreadsheet corresponds to the "neighborhood" of triples that have the same subject – i.e., an array of features.? The column labels in the spreadsheet are the predicates or relations in ontologies or knowledge graphs.? And the values at row-column coordinates are the object nodes in knowledge graphs.? Each link in a taxonomy tree can also be formatted as a child - predicate - parent triple, as in other kinds of knowledge graphs. These formats are reliably interchangeable, but choosing one of them determines what your tech stack will look like and determines the operations that you can and cannot perform on your knowledge store:? just try to do multi-hop queries in Excel! For the moment, let's assume a subject-predicate-object triple format so we use a unified terminology.
The bad news is that once we convert all our resources to (for example) a triple format, the nodes and predicates don't necessarily have a common semantics:? they were created by different teams making different assumptions about what a given predicate or node means. For example, in the resources cited above, each team defined "skill" differently, so they include or exclude different entities as skills, based on different criteria – which affects all the data in each resource.
Similarly, a feature, predicate, or tag like isaSmallBusiness will mean "less than 20 employees" in one knowledge resource and "less than 100 employees" in another. Also, triples with "Paris" as subject node might refer to a city in France, another in Texas, a company or subsidiary that happens to be located there, or a person – across different resources. "Sales this quarter" can denote anything from funds received to non-binding oral agreements to wishful projections. This problem of ambiguity is made catastrophically worse by the fact that most creators of knowledge resources neglect to curate and document (e.g., in some kind of data dictionary or catalog) the concepts that they use. The concepts might be clear to the developers, but they're not clear to anyone else.?
The flip side of this problem is that nothing prevents different teams from using wildly different labels for the same entities in different resources, which means that string matching across resources is essentially useless.? Every data scientist, at one time or another, has faced the problem of estimating the number of something like software engineers only to realize that the results are dramatically off because application developer, SWE, programmer, solutions architect, Java ninja, Entwickler, ingénieure logiciel, and literally thousands of other strings are used as labels for this same entity. If there are hundreds or thousands of synonymous strings and we query on only one or two or five strings, then our resulting analytics or query results are simply junk.?
领英推荐
Knowledge graphs to the rescue
Enter the knowledge graph as a tool for integrating data across multiple resources.? We noted above that many knowledge resources can be converted into a common format as knowledge graph triples.? Choosing (or developing) a knowledge graph for data integration also means using it to provide an initial common, objective semantics as a pivot or point of reference for all the different knowledge resources that we want to integrate.?
We speak of initial semantics because each new knowledge resource will contain new entities and new predicates that we can use to expand and enrich the original reference knowledge graph.? We speak of common semantics because the semantics of the reference knowledge graph will be shared across all the resources that map to it, and will enable queries across all resources. And we speak of objective semantics because in knowledge graphs the definition of each node is made explicit in the form of triples with that node as subject, and this enables more systematic validation and more accurate mapping from multiple resources. Not only can we review, validate, and curate these definitions as needed, but the conceptual components of these definitions are in a format accessible to the algorithms, not only to human reviewers.
The key idea is to integrate the different resources by mapping them all to the same reference knowledge graph, using the semantics of that knowledge graph as the standard -- or what is called a semantic integration layer.? This is a kind of soft standardization that maintains the entities and predicates of the original resources rather than laboriously converting them to the entities and predicates of the new standard (i.e., hard standardization). The mapping from each resource to the reference knowledge graph is soft in the sense that it is flexible and separate from the data itself, can be updated without affecting the original data, and might take the form of rules, algorithms, or human assignments.? A key part of the mapping is establishing correspondences between the predicates or features in the resources and those in the reference knowledge graph.? Doing this first makes the integration task more practical because there are far fewer predicates than nodes in any resource and we can then leverage mapped predicates to estimate the similarity of nodes for validation.
This many-to-one approach of mapping to a reference is a more general version of building (one-to-one) crosswalks, for example between industry taxonomies such as SIC and NAICS (example here). Crosswalks map arbitrary pairs of resources often without taking one to be the standard. In cases where there are many resources, it is far simpler to map each to a single reference rather than each to each other, so a much more scalable approach.?
So which specific characteristics give knowledge graphs their superpowers as a target or reference for data integration -- more powerful than taxonomies, ontologies, or databases? Here are some of them:
Knowledge graphs have other superpowers, too. Which of them have you seen in practice?
“What we know is a drop; what we don't know, is an ocean" (Sir Isaac Newton) #EnterpriseKnowledgeGraph #eccencaCorporateMemory #KnowledgeCentricity
1 年Very cool post! ?? Thanks Mike ????
Holistic Management Analysis and Knowledge Representation (Ontology, Taxonomy, Knowledge Graph, Thesaurus/Translator) for Enterprise Architecture, Business Architecture, Zero Trust, Supply Chain, and ML/AI foundation.
1 年Useful description. Your description aligns with my own approach, used since 1981, described often on LinkedIn and elsewhere. I use vocabulary term extraction from domain content, collection of vocabulary term "part of speech" definitions from that domain and its environment, modeling of triples and "triple to triple" structures using defined-term nodes and predicates, hypernym to hyponym organization of defined terms into a governed taxonomy, using defined terms to create a translation thesaurus (with "preferred term+definition") of the collected vocabulary, and other steps. My node root types, corresponding to basic questions, are: location, endeavor, performer, function, activity, resource, required capability. My predicate root types are: equivalence, containment, categorization, sequence, version, variance, description.
Thanks for sharing.
PEO Consultant
1 年Thanks for sharing!
Strategic Advisor | Human-Centric AI & Augmentation Expert | Driving Organisational Change & Process Excellence
1 年What a great and timely post