Entity Resolution: Priority #1 for Building Real Knowledge Graphs
I keep seeing mentions of "entity-resolved knowledge graphs", which leads me to believe that other so-called "knowledge" graphs don't resolve their entities. But a knowledge graph without entity resolution is like a beach under deep snow and dark skies. Yeah, you might optimistically want to call it a beach, but it's hardly usable as such.?
A knowledge graph without entity resolution is like a beach under deep snow and dark skies.
Entity resolution is not a "nice to have" – it's essential for a reliable knowledge graph worthy of the name. That's why I call the graphs without entity resolution "counterfeit" knowledge graphs – they're simply sentences formatted as graphs with raw strings as nodes. The counterfeit ones are easy to build, but hard to make useful. Let's see why.
What is Entity Resolution??
Entity resolution is a key aspect of building from text the nodes in a graph – and of determining what those nodes contain. Nodes that only contain strings or labels are so impoverished that they force us to rely on semantic information from the user – they don't capture knowledge, so it's not directly accessible for manipulation by algorithms.? These sketchy, uninformative nodes characterize what I call text graphs, since we haven't resolved the nodes to unique, richer concepts. Nodes that are equivalence classes of strings – collections of synonyms – usually also include a richer range of attributes as a machine-accessible definition so they add a significant layer of value to generic graphs and start to make them knowledge graphs.?
When we start from text data, entity resolution is a more general version of another well-known problem:? pronoun resolution. It's hard to know which entity he (or another pronoun) actually means – it's a "wildcard" word – and there is both research and tooling to help algorithms make good, language-specific guesses.? For entity resolution, instead of worrying only about pronouns, we also focus on names and noun phrases like those below – and try to figure out which entity they're talking about. We have to get the graph building algorithm to decide which entity node in the graph a string maps to in each sentence context.
Both kinds of resolution are specialized cases of word sense disambiguation – trying to establish what a given instance of a string actually means i.e., using evidence to "resolve" or assign it to one graph node or another. So entity resolution helps to solve a very, very serious problem for algorithms:? humans have this weird habit of talking about the same exact entity in a gazillion different ways. How's an algo to know who or what they're talking about without some extra guidance??
People might say "Barry O", "President Obama", "my husband" (said by Michelle), "Dad" (for his kids), "Mr President" (in 2010), "the Boss", "Renegade" (for his Secret Service staff), "the 44th President", "the Idiot in Chief", "Odumbo", "Nobama", "me" (when he says it himself), "he" or "him" or "his", "Barry O'Bomber", "オバマ", "Обама", "??????", "Q76" (for Wikidata), "the guy sitting next to Michelle", "Malia's Dad", "Barack Hussein Obama II", (and many other sequences of characters) to refer to the same exact entity: Barack Obama!
领英推荐
So how much really is a gazillion? Well, when we include abbreviations, acronyms, initialisms, typos, synonyms, slang, transliterations, formal titles, pronouns, accents, alternative character encodings, mispronunciations, and nicknames for just one language, it's common to have many hundreds of distinct sequences of characters or sounds for one lonely entity.? Multiply that by just 30 languages, and we're now faced with many thousands of ways of referring to the same single entity! Multiply that amount by the number of entities that we might want to find or mention in a complex domain and the problem quickly becomes one of processing millions of strings. And that's just to identify mentions of entities. We still haven't talked about identifying mentions of attributes, measures, other literal values, or relations yet. All that seems like a lot to me – enough to call it "a gazillion".
Why bother??
What really happens if you don't resolve your entity mentions?? In a nutshell, if you don't, then you're doing na?ve keyword spotting and all of your frequencies from text data are simply wrong, whether it's for analytics, machine learning models, or the transformers in your LLM. So there are very real consequences of taking this very common shortcut:?
If you don't resolve (or at least reduce) your entity mentions during pre-processing, then you need much more expensive (and still unreliable) computations later on to extract entity data. You will have to rely on some after-the-fact measure of similarity between the usual contexts of Obama and those of Nobama to make an unreliable guess (like with cosine distance) about whether they're "similar" entities. But because those who used Nobama were critics of his policies, they will say very different things than the supporters who used Obama – and algorithms will conclude that they are different entities. Oops. Context isn't really meaning, it's context – that's why we have different words.
If you're working with text data and don't do reliable entity resolution, then your analytics and models are simply trash, particularly if a lightweight evaluation suggests that they look fine.
Keyword spotting just doesn't cut it. You need entity resolution.
Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights
6 个月Thanks for sharing Mike Dillinger, PhD . Your article started out a bit ambiguous to your standpoint in this(in my opinion) but it was resolver rather quickly. Is there a way you see where the generated graph from unstructured text (equal to unresolved graphs?) will ever work? Or do they have to be created and curated by knowledge workers?
Founder & CEO | LLMs | ML/AI Training Data | Geopolitics | Geoeconomics | Speaker and Author | Patent Author
6 个月Exactly right. Language tech without synonyms (not sentiment!) misses meaning. The current use cases for this tech require input from subject matter experts that understand and can teach nuance to the machines. We did this years ago for a different purpose, but it positions our language data uniquely to deliver the most efficient & accurate training runs for the issues that we cover. Your post explains why. Thank you.
Exploring AI-driven Value l LLM Prompt Engineering
6 个月Just to clarify, does "entity-resolved graph" imply that we should eliminate irrelevant or illogical connections that might arise between nodes outside of system ontologies?
Community Manager @Habit10x
6 个月"Absolutely, entity resolution is crucial for accurate and reliable knowledge graphs. Without it, the data we get is skewed and unreliable, leading to errors in analytics and machine learning models. It’s like trying to solve a puzzle with missing pieces. By the way, I’m part of organizing a webinar on mastering procrastination, which ties into optimizing data handling and productivity. Thought I’d share the link in case it’s of interest:?? Event Link: https://www.dhirubhai.net/events/masterprocrastination-5strategi7236305447807315971/"
Author of 'Enterprise Architecture Fundamentals', Founder & Owner of Caminao
6 个月The term may be misleading as it suggests a modeling issue when in fact it's a knowledge one, namely how representations (e.g. graphs) can be attached to environments. And such attachment is not given (no truth in models ...) but depends on intents and determines problem and solution spaces. https://caminao.blog/overview/knowledge-kaleidoscope/ea-complexity/