Knowledge Graphs and Layers of Value, part 2
My last post described the first steps in – and the first layer of value from – building a knowledge graph. These initial steps identified the key concepts in a domain and the various terms that express each, producing what I call a concept catalog. More technically, this is a simple method of entity resolution.?
The concept catalog describes the most important nodes of the knowledge graph that we are building. These nodes represent the individual and category entities in your domain. To complete a graph, next we need to identify the most important edges or relations between them.?
Next step:? Find the relations.
You want the relations of the knowledge graph that you will use in production to cover the most important relations for your use case.? Not any possible relation but the ones that consistently structure the nodes in your domain.? There are a couple of useful sources for relations that you can target.?
There are generic, domain-independent relations that crop up everywhere like those used to build taxonomies (subcategoryOf and instanceOf) as well as additional general purpose relations like partOf, usedFor and others, which are used to build ontologies. Taxonomies and ontologies, then, are simply components of the whole knowledge graph that you are building.
There are also domain- or project-specific relations that are unique to your specific applications. But which ones are important? As with the nodes, business priorities and frequency of the terms that express key relations are very important guides for finding the most important relations in a domain.? So I routinely work up a list of the most frequent adjectives, verbs, and units of measure as a starting point for identifying them.?
If you're building an integration layer over a collection of databases, the terms are again column labels and maybe the values in key columns that are used for classification – but this time (since we set aside the nouns) the adjectives, measures, or features of key entities will become more visible. If you want to build more intelligent or faceted search, your terms will again be user queries – now focusing on adjectives and measures. If you want to cover a collection of product descriptions of manufacturing processes or customer support questions, the terms are the adjectives, verbs, and measures that show up in the texts that describe them. Nouns for the most part indicate entities, so we need to remove them from our frequency list to find the adjectives and measures that indicate characteristics of these entities.
A very important step at this point is to identify aggregates of related adjectives or measures and use their category label as the name of the relation. For example, we might find red, blue, teal, green, yellow as common adjectives. We can include in our graph a relation like hasColor that can have those colors as values. Similarly, we might find in a different domain measures like feet, meters, inches, yards, millimeters – or subjective measures like long and short – and add a relation like hasLength. By a similar process, we use well-known, generic relations like subcategoryOf and usedFor to aggregate the noun entities that we identified before. All of this aggregation enables an additional kind of systematic, meaning-driven data compression that we realized while creating the initial nodes of the knowledge graph, based on synonymy.
Verbs are a special case and are often not treated as systematically – because many of them don't easily fit into the super-simple triple format that is standard usage for knowledge graphs. Key verbs will indicate very important relations in a knowledge graph. Most often sentences with verbs get forced into triple format – the square-peg-in-round-hole method – using arbitrary and incomplete labels. The more systematic (but less common) approach is to implement hypergraphs that allow more than two entities for a single relation – so the graph will represent verbal constructions more precisely.
The Second Layer of Value
领英推荐
The first phase of knowledge graph building (in my previous post) identifies the most important entities and the many terms used to identify each of them. When done carefully, this implements a single semantic relation: synonymOf. This yields dramatic data compression and adds value in different ways.
The second phase described here identifies the most important relations – mostly focused on attributes of entities – and aggregates the terms used to define them. This kind of data compression based on adjectives and measures implements a different semantic relation: instanceOf but focuses on aggregating attributes rather than entities.?
For search and recommendations, mapping a variety of terms to a single category concept – for example, colors – is another kind of semantic indexing. It, too, increases liquidity but also drastically simplifies the feature space used to characterize the entities of interest. Searching for a much smaller collection of entities in a much smaller feature space makes search and recommendations faster and more scalable – as well as more accurate.
For machine learning, the relations define a feature space with more interpretable, less variable, semantically aggregated features – rather than ad hoc strings.
For analytics, the relations aggregate over column labels in different data sources to create a map or semantic integration layer that helps overcome the many difficulties of siloed data developed by different teams.? This is a key step in moving toward create-once-use-often Findable, Accessible, Interoperable, Reusable (FAIR) data that can reliably be shared across teams and across organizations.
For asset management, these relations expand the collection of standardized, reusable tags for annotating video, audio, written, and computer-generated assets. As we create more semantically defined tags of different types, cross-modal and cross-lingual search become more accurate and analysis of content from wide-ranging sources becomes more precise.?
In sum, even making a little bit of additional structured knowledge explicit – along the lines sketched here – adds yet another layer of significant value.?
One way to think of knowledge graph construction is adding layer upon layer of meaning (and value) where each layer corresponds to one or more different semantic dimensions.? First, we built a layer based on the semantic relation of synonymOf, then we built a new layer based on instanceOf (and perhaps others). Each time we add a new semantic dimension, we multiply the ways in which we can relate or cross-index the entities in our domain, adding simultaneous value to a wide range of use cases.
But all of this is still based on a very simplistic view of a "concept" as simply an equivalence class of terms with no internal structure. Sure we can structure these simplistic nodes and relations as a graph – a text graph – but it's na?ve to think that this view might enable autonomous artificial intelligence – because it always requires humans to assess when and whether terms have equivalent meanings. We don't have the mechanisms yet for an AI to assess whether different terms and concepts refer to the same components and characteristics.
So we need additional steps to build a more knowledgeable knowledge graph – and add more layers of value – the topic of my next post.
Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights
5 个月Ontologies have entered the playing field, great! Hypergraphs (or perhaps rather metagraphs) should be a logical next step for there areas where the knowledge graph becomes to limited in its usefulness (multidimensional representation). One area where knowledge graphs are limited is the definition and implementation of attributes, hasLength is a dangerous relation as Length is often defined (or lack thereof, as the definition is by default in the context of the system and it's used) in many different ways across many different systems with different UOMs. How do you suggest you manage these across systems and (organizational) functions?
Data-Centric AI/ML, Semantic Engineering, Process Automation, Enterprise Data Transformation, Holographic Marketing
6 个月Alan Rodriguez Layers of value are a fundamental component of establishing data and data products as property, and keeping track of them can transform our relationship to (and investment in) information and knowledge
Author of 'Enterprise Architecture Fundamentals', Founder & Owner of Caminao
6 个月Knowledge graphs are like bottles and levels say nothing about contents, water or wine. https://caminao.blog/overview/knowledge-kaleidoscope/ea-symbolic-twins/
Holistic Management Analysis and Knowledge Representation (Ontology, Taxonomy, Knowledge Graph, Thesaurus/Translator) for Enterprise Architecture, Business Architecture, Zero Trust, Supply Chain, and ML/AI foundation.
6 个月Thanks for posting this, and the previous part, which reinforce my prior methods, as we've discussed. As commented before, my 4 decade approach to this is explained at https://www.dhirubhai.net/feed/update/urn:li:activity:7197143807740526594?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7197143807740526594%2C7197217489435467778%29&replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7197143807740526594%2C7197255463774695425%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287197217489435467778%2Curn%3Ali%3Aactivity%3A7197143807740526594%29&dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287197255463774695425%2Curn%3Ali%3Aactivity%3A7197143807740526594%29
Balaji D Loganathan Surendran Sukumaran