Knowledge Graphs Add Layers of Value, part 1
Image by you.com

Knowledge Graphs Add Layers of Value, part 1

I see a growing consensus: more and more people are catching on to the value of knowledge graphs. I hear fewer and fewer questions this year about the basic concepts of knowledge graphs as fancy databases . Now I'm starting to field much more interesting, next-step questions like How to start building a knowledge graph? What to prioritize? How big does it have to be? Which tools will help? How do knowledge graphs add value?

Let me try to start answering these questions together, since each step of building a knowledge graph adds value in different ways.??

Clearly, the first step is to identify which concepts the nodes of your graph will need to cover and that's the focus of today's post. ? It's essential to remember that only a miniscule number of very large organizations will need to cover millions of concepts. Small knowledge graphs are both valuable and much more tractable as a place to start or to test out ROI.??

First step: Identify what's important.

You want the nodes of your Knowledge Graphs in production to cover the most important concepts in a given domain – your domain. Not all the concepts in your domain. Not the whole universe. Business priorities are a crucial determinant of concept importance – but in my experience they're often not clear or specific enough to help. Unless, of course, you're building a knowledge graph for a specific business initiative, like I've done for nursing, industries, and green skills.?

Another more generic but key indicator of the importance of concepts is the frequency of the terms used to express them. So I routinely start projects with a list of term frequencies. If you're building an integration layer over a collection of databases, the terms are column labels and maybe the values in key columns used for classification of entities. If you want to build more intelligent search, your terms will be user queries. If you want to cover a collection of product descriptions or manufacturing processes or customer support questions, the terms are the nouns, verbs, and adjectives that show up in the texts that describe them. Nouns are much more frequent, so they'll make up most of this initial list.

The good news:? Counting term frequencies is a fully automatic process. Pro Tip: Especially if you focus on user generated content, you really need to do spell checking on your initial list beforehand, it will save you a lot of time during the next steps.

The other good news: You'll get a number of terms that is 10 times smaller than the total number of words in your collection of texts. Even better: you can discard about 70% (or more!) of those terms because they're not frequent enough to pay attention to. [In product documentation, I routinely saw that 75% of terms occur only once or twice. Similarly, a collection of 5 million queries boiled down to about 50,000 terms.] So a 100,000-word collection of texts will yield only about 3,000 terms to start with.?

Next step:? Find the synonyms.?

Another key indicator of the importance of concepts is the number of different terms – synonyms, spellings, abbreviations, jargon, loanwords, etc. – used to express the same concept.? At LinkedIn, for example, we saw literally hundreds of these "aliases" for the concept of software engineer – and that didn't include most typos or any of the translations! At eBay, we saw dozens, even hundreds!, of typos for brand names – many of which were not caught by the spell checkers. All this variation means that your 3,000 initial terms will get whittled down to fewer than 1,000 concepts.

The idea behind this process is that you're thinking of a concept (at this point) as an equivalence class of terms. And the key criterion for deciding equivalence is having "the same meaning" – strictly the same, not just a related meaning.

Sure this approach is subjective and fuzzy but it works quite well for a first pass.? At a later stage, you can tie down and document the components of these concepts more thoroughly and more explicitly in the knowledge graph.?

Recall that terms are ambiguous, especially single-word terms. So "architect" will show up as a synonym for the different concepts of information architect, software architect, knowledge architect, submarine architect, etc.? I make a point of documenting this ambiguity by adding each synonym to all of the concepts it is used for. This documentation step makes it much easier later on to identify errors and work on processes to disambiguate these synonyms.?

Standard practice is to sort the list of terms by decreasing frequency and review them quickly by hand from top to bottom – from the most impactful terms to the less impactful ones. I frequently talk to people who are allergic even to the mention of manual work, especially work that they don't know how to do or to evaluate. But when you optimize the tasks, it is a very, very small price to pay for more quality control, fewer embarrassing outputs, fewer support requests, less brand erosion -- and much less re-work. Besides, it's easy to outsource this kind of work to language services providers – so there's no excuse at all not to get it done.?

Using LLMs or other tools is likely to speed up the review work significantly, but it's dangerous to rely blindly on tools (as so many people do), no matter how cool and trendy they are. Oh, and engineers aren't a good fit for this kind of work, either:? they have no relevant training for it and they tend to find it even more disheartening than documenting their code. I have seen time and time again over the years that analytic linguists really rock for these tasks. They do the work faster and with much higher quality (so there is far less re-work later on) because they have the most relevant training and a real sense of craftsmanship.

During this review step, the most frequent version of a term becomes the standard label (or "preferred term") for the concept – if it's not mangled or misspelled! – and any less frequent synonyms get added to the standard version's synonym list. The frequency of each synonym also gets added to the standard version's frequency so you can tally the total number of mentions of the concept across all synonyms (as an indicator of the concept's importance in your domain).

With this process, you end up creating a concept catalog – a table that has a single concept (a knowledge graph node) on each row, along with?a standard label for the concept, a curated list of synonyms or "aliases" or "dispreferred terms", and the total number of mentions (frequency of the standard version plus frequencies of other versions). I always include a unique ID (like they do for Wikidata concepts) and an initial definition or description of each concept to prepare for next steps.

The First Layer of Value

This kind of concept catalog doesn't manage to sound sexy, but it is very useful and adds lots of value. On top of that, it is only the first layer of value that a knowledge graph will add. This first step simply focuses on identifying the conceptual nodes that the knowledge graph should include -- draft requirements, as it were. Next steps create more value by enriching and relating these nodes further.

Note that along the way you reduced 100,000 tokens in documents to fewer than 1,000 concepts, and this is a systematic, meaning-driven form of data compression. Mapping terms to concepts dramatically reduces the dimensionality and complexity of the problem space underlying your applications.? Indexing and search in this 100-times-smaller space is much faster and requires far less memory, storage, and other hardware – while improving accuracy.

For search and recommendations, mapping a variety of synonyms in searched documents to the same concept – a kind of semantic indexing – increases liquidity, that is it identifies more content as potentially relevant so you will overlook fewer instances and make fewer omissions. This additional liquidity gives the ranking algorithm of a search engine more to work with and a more explicit notion of relevance, increasing search quality, as well.? Searching over a far smaller collection of concepts than distinct terms (with more data for each concept) also makes search more accurate by helping to dampen the variability among terms and term contexts that hinders mathematical modeling. This mapping is also the basic mechanism by which knowledge graphs help RAG-based systems feed more compact, more relevant information to LLMs.

For machine learning, the rows in the concept catalog define cleaner, more reliable, semantically aggregated features. Better features help manage noise and increase model accuracy.

For analytics, mapping a variety of synonyms in the column labels of different data sets to the same concept improves linkage across tables for better data integration. This "semantic layer" gives data scientists more data and more relations between data to work with, yielding analytics with more representative coverage and a stronger empirical basis – regardless of which silo they are housed in.?This approach is far more robust and reliable than term spotting with a list of key words.

For multimedia asset management, this kind of concept catalog creates standardized, reusable tags that enable a uniform approach to annotating video, audio, written, and computer-generated assets.? This in turn makes cross-modal and cross-lingual search and recommendations more accurate and more scalable. Using standardized concepts as tags also has important applications in analyzing customer feedback, product focus group data, business intelligence, and competitor's marketing strategies. This is also becoming the default method for annotating biomedical and financial data, using standardized concepts from ontologies (one kind of knowledge graph).

In sum, even a little bit of simply structured knowledge – along the lines sketched here – adds a layer of significant value.?

And this simple knowledge sets the stage for building more layers that enrich and multiply the value that you get from knowledge that has more and different kinds of structure.

?

Pietro La Torre

Data Strategy ?? | Data Governance????| Data Engineering ?? | ??? I write about data

2 个月

Very interesting and clear article! You don't need a lot of effort to start, and beginning small is always a good idea because you can improve and expand gradually. Moving from theory to practice is straightforward.

Daniel Lundin

Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights

5 个月

My question might be a bit premature as I haven't read part 2 and 3 yet. What's not mentioned yet is the ontological connections between the concepts, my experience tells me that you should do it in parallel whilst you are defining the core concepts. What's your approach to this?

回复
Maxime Blanchard

Data Engineer (Apprenticeship)

5 个月

Hello Mike, thank you very much for this three parts article, very clear and pragmatic. I tried to put the first part of your methodology in practice by collecting a corpus of texts and applying NLP techniques in order to extract the most important concepts of it. Is it possible to add and contact me on LinkedIn so that I can send you a short video of what I have done and tell me if I'm on the right track? Best,

Hank Ratzesberger

DevOps Engineer @ i/o Werx

6 个月

What happens when a term is very commonly over-used? I have a git "repository," and I push compiled artifacts to a "repository," and in the process also pull artifacts from same and push to a test results "repository." The most common term is now only a synonym.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了