登录查看更多内容

Entity Resolution: Priority #1 for Building Real Knowledge Graphs

Mike Dillinger, PhD

发布日期: 2024年9月6日

I keep seeing mentions of "entity-resolved knowledge graphs", which leads me to believe that other so-called "knowledge" graphs don't resolve their entities. But a knowledge graph without entity resolution is like a beach under deep snow and dark skies. Yeah, you might optimistically want to call it a beach, but it's hardly usable as such.?

A knowledge graph without entity resolution is like a beach under deep snow and dark skies.

Entity resolution is not a "nice to have" – it's essential for a reliable knowledge graph worthy of the name. That's why I call the graphs without entity resolution "counterfeit" knowledge graphs – they're simply sentences formatted as graphs with raw strings as nodes. The counterfeit ones are easy to build, but hard to make useful. Let's see why.

What is Entity Resolution??

Entity resolution is a key aspect of building from text the nodes in a graph – and of determining what those nodes contain. Nodes that only contain strings or labels are so impoverished that they force us to rely on semantic information from the user – they don't capture knowledge, so it's not directly accessible for manipulation by algorithms.? These sketchy, uninformative nodes characterize what I call text graphs, since we haven't resolved the nodes to unique, richer concepts. Nodes that are equivalence classes of strings – collections of synonyms – usually also include a richer range of attributes as a machine-accessible definition so they add a significant layer of value to generic graphs and start to make them knowledge graphs.?

When we start from text data, entity resolution is a more general version of another well-known problem:? pronoun resolution. It's hard to know which entity he (or another pronoun) actually means – it's a "wildcard" word – and there is both research and tooling to help algorithms make good, language-specific guesses.? For entity resolution, instead of worrying only about pronouns, we also focus on names and noun phrases like those below – and try to figure out which entity they're talking about. We have to get the graph building algorithm to decide which entity node in the graph a string maps to in each sentence context.

Both kinds of resolution are specialized cases of word sense disambiguation – trying to establish what a given instance of a string actually means i.e., using evidence to "resolve" or assign it to one graph node or another. So entity resolution helps to solve a very, very serious problem for algorithms:? humans have this weird habit of talking about the same exact entity in a gazillion different ways. How's an algo to know who or what they're talking about without some extra guidance??

People might say "Barry O", "President Obama", "my husband" (said by Michelle), "Dad" (for his kids), "Mr President" (in 2010), "the Boss", "Renegade" (for his Secret Service staff), "the 44th President", "the Idiot in Chief", "Odumbo", "Nobama", "me" (when he says it himself), "he" or "him" or "his", "Barry O'Bomber", "オバマ", "Обама", "??????", "Q76" (for Wikidata), "the guy sitting next to Michelle", "Malia's Dad", "Barack Hussein Obama II", (and many other sequences of characters) to refer to the same exact entity: Barack Obama!

领英推荐

Search Ranking Starts with Data: The Ultimate Playbook…

Setu Chokshi 1 周前

Exploring OWL Ontologies Visually: A Paradigm Shift in…

Nicolas Figay 2 个月前

Top RAG Papers of the Week (November Week 3, 2024)

Kalyan KS 4 个月前

So how much really is a gazillion? Well, when we include abbreviations, acronyms, initialisms, typos, synonyms, slang, transliterations, formal titles, pronouns, accents, alternative character encodings, mispronunciations, and nicknames for just one language, it's common to have many hundreds of distinct sequences of characters or sounds for one lonely entity.? Multiply that by just 30 languages, and we're now faced with many thousands of ways of referring to the same single entity! Multiply that amount by the number of entities that we might want to find or mention in a complex domain and the problem quickly becomes one of processing millions of strings. And that's just to identify mentions of entities. We still haven't talked about identifying mentions of attributes, measures, other literal values, or relations yet. All that seems like a lot to me – enough to call it "a gazillion".

Why bother??

What really happens if you don't resolve your entity mentions?? In a nutshell, if you don't, then you're doing na?ve keyword spotting and all of your frequencies from text data are simply wrong, whether it's for analytics, machine learning models, or the transformers in your LLM. So there are very real consequences of taking this very common shortcut:?

You teach your model that Renegade and the others are NOT the same as Obama. Your algorithm assumes that different spelling = different entity. You get different counts for your entity depending on which variants you happen to use. (I worked with gnarly eBay user data – so I know how bad this morass of variants can get.)
You ignore the fact that the same words my husband, Mr President, Dad, he, him – can mean different things. Your algorithm assumes that same spelling = same entity. So again you get really messy, unreliable counts.
You have a biased sample – biased toward the variants that people like you the analyst happen to use, not a more reliable random sample. But your algorithm assumes that all the mentions found are all the mentions present.
You will get smaller, more variable context for each string (instead of for each entity), so your embeddings and context matching are much less reliable.
Your search or query on a single keyword like Obama returns only a small subset of the many, many mentions – so your sentiment analysis (or other model) does not have representative data as input. Your algorithm assumes that all the mentions retrieved are representative of all the mentions made. Not if you don't do entity resolution.
You get less "liquidity" for search:? i.e., fewer mentions of each candidate string with less information about each. So re-ranking to find the best response is much harder.

If you don't resolve (or at least reduce) your entity mentions during pre-processing, then you need much more expensive (and still unreliable) computations later on to extract entity data. You will have to rely on some after-the-fact measure of similarity between the usual contexts of Obama and those of Nobama to make an unreliable guess (like with cosine distance) about whether they're "similar" entities. But because those who used Nobama were critics of his policies, they will say very different things than the supporters who used Obama – and algorithms will conclude that they are different entities. Oops. Context isn't really meaning, it's context – that's why we have different words.

If you're working with text data and don't do reliable entity resolution, then your analytics and models are simply trash, particularly if a lightweight evaluation suggests that they look fine.

Keyword spotting just doesn't cut it. You need entity resolution.

Knowledge Architecture

3,608 位关注者

Daniel Lundin

Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights

6 个月

Thanks for sharing Mike Dillinger, PhD . Your article started out a bit ambiguous to your standpoint in this(in my opinion) but it was resolver rather quickly. Is there a way you see where the generated graph from unstructured text (equal to unresolved graphs?) will ever work? Or do they have to be created and curated by knowledge workers?

Barbara C. Matthews

6 个月

Exactly right. Language tech without synonyms (not sentiment!) misses meaning. The current use cases for this tech require input from subject matter experts that understand and can teach nuance to the machines. We did this years ago for a different purpose, but it positions our language data uniquely to deliver the most efficient & accurate training runs for the issues that we cover. Your post explains why. Thank you.

1 次回应

Dmytro Melnychenko

Exploring AI-driven Value l LLM Prompt Engineering

6 个月

Just to clarify, does "entity-resolved graph" imply that we should eliminate irrelevant or illogical connections that might arise between nodes outside of system ontologies?

Afroz Ali

Community Manager @Habit10x

6 个月

"Absolutely, entity resolution is crucial for accurate and reliable knowledge graphs. Without it, the data we get is skewed and unreliable, leading to errors in analytics and machine learning models. It’s like trying to solve a puzzle with missing pieces. By the way, I’m part of organizing a webinar on mastering procrastination, which ties into optimizing data handling and productivity. Thought I’d share the link in case it’s of interest:?? Event Link: https://www.dhirubhai.net/events/masterprocrastination-5strategi7236305447807315971/"

Rémy Fannader

Author of 'Enterprise Architecture Fundamentals', Founder & Owner of Caminao

6 个月

The term may be misleading as it suggests a modeling issue when in fact it's a knowledge one, namely how representations (e.g. graphs) can be attached to environments. And such attachment is not given (no truth in models ...) but depends on intents and determines problem and solution spaces. https://caminao.blog/overview/knowledge-kaleidoscope/ea-complexity/

查看更多评论

要查看或添加评论，请登录

Mike Dillinger, PhD的更多文章

Knowledge Graphs: Artificial Knowledge for Artificial Intelligence

2025年3月14日

Knowledge Graphs: Artificial Knowledge for Artificial Intelligence

Intelligence is simply being good at thinking: at using what you know to make sense of what you don't. That might be…

60 条评论
Herding, Culling, and Caging Predicates for Knowledge Graph Relations

2025年2月7日

Herding, Culling, and Caging Predicates for Knowledge Graph Relations

Lists and bags and sets are jumbles of items that I find aberrant and abhorrent. So when I see people blithely invent…

14 条评论
Diversity, Depth, and Density of Knowledge Graph Relations

2025年1月13日

Diversity, Depth, and Density of Knowledge Graph Relations

At the top of my list of New Year’s resolutions this year is relation resolution. In the world of structured knowledge,…

15 条评论
New Year's Resolutions for your Knowledge Graphs

2024年12月23日

New Year's Resolutions for your Knowledge Graphs

As you enjoy your holiday season, I suggest two resolutions to consider for the New Year: entity resolution and…

18 条评论
Knowledge Graphs and Monkey Business with Generative AI

2024年12月9日

Knowledge Graphs and Monkey Business with Generative AI

Throughout the year I got poked and prodded and challenged in a bunch of different ways by my friends, colleagues, and…

7 条评论
Thanks for them Knowledge Graphs

2024年11月28日

Thanks for them Knowledge Graphs

It's Thanksgiving Day here in the US. A time to count one's blessings.

10 条评论
Knowledge Graphs are Essential for Safe AI

2024年11月11日

Knowledge Graphs are Essential for Safe AI

AIs will only be safe for general use when they have and use goals and values that are identical to those of humans. In…

30 条评论
Knowledge graphs, Linguists, and the Last-mile problem of AI

2024年11月4日

Knowledge graphs, Linguists, and the Last-mile problem of AI

Now that AI can generate fluent text at scale in multiple languages and different styles, are authors, translators…

22 条评论
Audio: How to make AI safe and reliable?

2024年10月21日

Audio: How to make AI safe and reliable?

Janie and Johnny are back for Episode 2 of my Byte-sized AI series! Listen in to these engaging, bite-sized podcasts to…
Audio: What are Knowledge Graphs?

2024年10月1日

Audio: What are Knowledge Graphs?

Who knew? It seems that Max Headroom had blue-eyed twins and they're all grown up! I suspect that he sent them to…

10 条评论

See all articles

Entity Resolution: Priority #1 for Building Real Knowledge Graphs

Mike Dillinger, PhD

领英推荐

Knowledge Architecture

3,608 位关注者

Mike Dillinger, PhD的更多文章

社区洞察

其他会员也浏览了

Latest news about #Neo4j

Retrieval Augmented Generation (RAG) for Structured Data Processing

Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes

Back to Basics with Sorting Algorithms

Revolutionizing Vector Databases with Level of Detail (LOD): A Game-Changer in Data Retrieval

Embedding Distance To Enhanced Answer Quality: A Simple Dive

23-4-1 Getting started with Pinecone Vector Database

Difference Between Vector DB and Graph DB in RAG Applications

SageX and ADI: Transforming Unstructured Data into Actionable Intelligence

Vector Databases in AI/ML: the next-gen infrastructure for intelligent search

领英推荐

Knowledge Architecture

3,608 位关注者

Mike Dillinger, PhD的更多文章

Knowledge Graphs: Artificial Knowledge for Artificial Intelligence

Herding, Culling, and Caging Predicates for Knowledge Graph Relations

Diversity, Depth, and Density of Knowledge Graph Relations

New Year's Resolutions for your Knowledge Graphs

Knowledge Graphs and Monkey Business with Generative AI

Thanks for them Knowledge Graphs

Knowledge Graphs are Essential for Safe AI

Knowledge graphs, Linguists, and the Last-mile problem of AI

Audio: How to make AI safe and reliable?

Audio: What are Knowledge Graphs?

社区洞察

其他会员也浏览了

Latest news about #Neo4j

Retrieval Augmented Generation (RAG) for Structured Data Processing

Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes

Back to Basics with Sorting Algorithms

Revolutionizing Vector Databases with Level of Detail (LOD): A Game-Changer in Data Retrieval

Embedding Distance To Enhanced Answer Quality: A Simple Dive

23-4-1 Getting started with Pinecone Vector Database

Difference Between Vector DB and Graph DB in RAG Applications

SageX and ADI: Transforming Unstructured Data into Actionable Intelligence

Vector Databases in AI/ML: the next-gen infrastructure for intelligent search