Superpowers of Knowledge Graphs, part 2: Data Aggregation
Image by Freepik

Superpowers of Knowledge Graphs, part 2: Data Aggregation

Knowledge graphs have superpowers above and beyond those of ontologies, taxonomies, and mere databases. They can't jump tall buildings in a single bound, but they do have something akin to x-ray vision and other powers. We talked about one such superpower in a previous post:? data integration – across many sources (and in a later post about taming LLMs).

Another key superpower of knowledge graphs is data aggregation – across many instances.

Imagine you have millions or billions (or more!) of rows of data, each of which has many thousands of columns.? Just as we can't see x-rays or infra-red energy, hear ultra-high frequency sounds, or visualize thousand-dimensional spaces, humans simply cannot identify or compare patterns in such gigantic data sets.? We have to aggregate this data to make it useful and valuable – which is why the dark arts of Data Science are so essential today.

Takeaways for the hurried and harried

Choose your knowledge architecture carefully to enable the data aggregations that you need.

  • Ontologies offer little or no aggregation of individual instances:? they focus on categories and often don't include instances, so they can't directly map instances to categories. Ontological categories, though, are very useful as targets for aggregation. Relating dogs to cats: fine. Dogs to letter carriers: fine. Fido to dogs: nope. Fido to LunaTheCat: nope.
  • Taxonomies offer very limited, lossy data aggregation:? instances can be aggregated with only one relation (instanceOf) into only one parent category (per the messy MECE principle).? Taxonomies only relate things to similar things:? dogs to dogs, flowers to flowers, atoms to atoms. Relating dogs to fleas: nope. Dogs to letter carriers: nope. Dogs to cats: nope. Fido to LunaTheCat: nope. Aggregation with taxonomies yields labels.
  • Knowledge graphs enable robust, flexible, diverse, and less lossy aggregation:? they take taxonomies and ontologies to the next level.? Graphs include both individuals and categories and enable us to relate things to both similar and different things when we aggregate. Relating dogs to cats: fine. Fido to dogs: fine. Dogs to fleas: fine. Dogs to letter carriers: fine. Fido to LunaTheCat: fine. Aggregation with knowledge graphs yields more informative, expandable subgraphs.

Foundations of Data Aggregation

Data aggregation is based on the fundamental distinction between individual instances (like Jack Nicholson or the Eiffel Tower) and collections of these instances (variously called categories, classes, kinds, types, etc.) like people, actors, teams, etc. This distinction is so important for human understanding that it shows up in many different forms across basically all domains:? item vs group, specimen vs species, reference vs sense, episodic memory vs semantic memory, token vs type, member vs team, particulars vs universals, and (in math) scalar value vs distribution, tuple vs relation, and element vs set.?

Data aggregation boils down to summarizing the features of a collection or category based on the features of the individuals in it.

Simply stated, an individual instance is unique: its cardinality is 1 and only 1. There's only one Jack Nicholson and only one Eiffel Tower; they are unique instances of categories like actors or buildings. Measurement theory tells us that a single data point is the unique result of measuring some predicate (property, attribute, characteristic) in some unique instance at a specific time and place, with a specific method – recorded as a fact. So "data", technically, is a collection of facts with values for specific predicates about individual instances – data is a collection of unique measurements, the input for data aggregation.? Ontologies downplay or ignore instances entirely, so they are most useful in getting ready for data aggregation – by defining the target categories that can guide the process.? Not all taxonomies include instances (the ones without instances are sometimes called typologies), but when they do they provide an important though very limited and lossy method for data aggregation – by relating each instance to a single category. Knowledge graphs can be seen as a significant extension to both ontologies and taxonomies:? knowledge graphs often emphasize instances and diversify data aggregation by enabling each instance to relate to a wide range of categories through a range of different relations.

A collection or category (I'll call them categories here), on the other hand, can have any cardinality, including 0 (we can even imagine and define a category that has no instances, like Martian actors). The output of data aggregation is values for features of categories. But we can't actually measure values for features of categories – we can only compute, predict, or infer them, based on the measured values for the instances in the category. In fact, basic statistics tells us that features of categories have different kinds of values from features of instances:? because they denote categories, the values of category features need to be descriptors of distributions, such as measures of central tendency, dispersion, skewness etc.? Describing a distribution for a category feature with a default value or only its mean is a dramatic oversimplification that unintentionally increases the noise in the data – it's essentially self-inflicted error.? It's also a simplification that is entirely inconsistent with logic, which assumes that an assertion like "adult male humans are 170 cm tall" is true for all adult male humans, even though means are, by definition, not true values for most of the instances in a given category.

Methods of Data Aggregation

It's helpful to think of two distinct families of data aggregation methods:? quantitative aggregation and conceptual aggregation.??

The most familiar quantitative methods for data aggregation are the descriptive statistical methods for comparing multiple instance feature values in terms of a single dimension, by computing a mean, standard deviation, kurtosis, etc. (i.e., category features) for those values – these are kinds of aggregation by computation. When we measure temperature, we map numbers to the feature, as in 32 degrees Celsius. If we take the temperature of 100 instances of a category, we can describe the temperature of the category by saying that the category feature's mean is 31.2 degrees Celsius and the category feature's standard deviation is 1.7 degrees. Describing categories with distributions gives us additional information about how typical the values for instance features are – which is useful for spotting exceptions, anomalies, and errors.

We often generalize this approach to multiple dimensions by computing n means, standard deviations, etc. for the same category or cluster of instances – using multidimensional techniques of machine learning such as clustering and classifier algorithms: a centroid in clustering is essentially an array of means of many features and it describes the "center" of a category of instances. Classifiers aggregate values by trying to determine the optimal boundaries (rather than the center) of the category.??

Less familiar are the conceptual methods for aggregating instance features with non-numeric values, such as terms.? When we "measure" or provide values for a feature like color, we use strings as values for the feature, like "green".? When we "measure" the color of 100 similar objects, say apples, how can we find something like a mean? How can we describe the distribution of their values?? One common way is to convert these non-numeric values into numbers and to tally the frequencies of the distinct values – regardless of how similar (or dissimilar) those values are in qualitative terms – and aggregate them statistically, as described above. (This ignores the conceptual difficulty of deciding which values we consider to be "the same" when we count them, since counting is categorization in disguise).

Another way is to aggregate them conceptually – in terms of what they mean:? group and count the greens, the reds, the other color values, for example. In a very useful sense, a parent term like "green" represents something like a "mean" concept that captures the commonalities of chartreuse, bottle green, kelly green, forest green, lime green, etc. (and more shades of green).?

We've seen that ontologies usually don't include enough information about instances for this kind of aggregation by lookup in which we store instance/category pairs as facts in lists, tables, trees, graphs, or other formats.??

Taxonomies often do include instances and are the workhorses of conceptual data aggregation, in fact they seem to have been invented for the specific purpose of aggregating specimens (instances) into species (categories) – aka classification – a clear example of aggregation by lookup. One key constraint, however, is that in taxonomies instances can be aggregated with only one relation (instanceOf) into only one parent category (per the messy MECE principle). In practice, though, taxonomy work is very difficult because instances can be aggregated in many, many possible ways, no single one of which is "best" for all use cases or can capture different perspectives or points of view. So forcing a single aggregation creates a single simplified representation that is very lossy and very difficult to build.??

Knowledge graphs take conceptual data aggregation to the next level.

Knowledge graphs are not hobbled by the single-parent constraint common in taxonomies and some ontologies, so the same instance can be aggregated in many different ways. Even the much rarer poly-hierarchical taxonomies in which multiple parent categories are permitted still aggregate instances using only one type of relation (instanceOf). Knowledge graphs, on the other hand, can and often do include hundreds of instance-category relations, making it possible to aggregate the same instances in many, many ways – rather than just one.? This dramatically reduces the lossiness that characterizes taxonomic representations because knowledge graphs can capture more and more varied information, adapt to more use cases, and encode more different user perspectives.??

There are dramatic differences, too, in the information that data aggregation yields.? When we aggregate by lookup in a taxonomy, for each instance the lookup returns a parent category label, sometimes with a human-readable definition from an external data dictionary, plus a label for the grandparent category.? Category labels indicate that instances of a category are similar but do not document why (so we can't validate them at scale) – I call this similarity by decree. All the downstream algorithms have to work with is labels.

When we aggregate by lookup in a knowledge graph, for each instance the lookup returns a category label which is a pointer to a subgraph:? i.e., to all of the triples in the graph that have this category as a subject. Category labels again indicate that instances of a category are similar and the linked subgraph documents why they are considered similar – I call this similarity by description. This means much, much more information for the downstream algorithms to work with: rich subgraphs that can be expanded for more details as necessary.

Which specific characteristics give knowledge graphs their superpower for data aggregation?

  • Knowledge graphs return rich subgraphs as the output of data aggregation, not just labels.?
  • Knowledge graphs include both individual instance and category nodes, as well as literal values, unlike ontologies, which usually include only category or class nodes.?
  • Knowledge graphs include many kinds of instances and categories, unlike taxonomies which usually focus only on the one type of entity described by the root node.
  • Knowledge graphs include many kinds of predicates, unlike taxonomies which usually include only instanceOf and subcategoryOf predicates.
  • Knowledge graphs don't rely on additional external resources like data dictionaries or data catalogs (which are very often missing) to define instances and categories.
  • Knowledge graphs make instance and category definitions machine readable, not only human-readable, unlike taxonomies.
  • Knowledge graphs define instances and categories explicitly in terms of other defined entities – a kind of cognitive recursion, unlike other resources that define concepts in terms of only-human-interpretable strings.

Knowledge graphs have other superpowers, too. Which of them have you seen in practice?
Cedric Signori

Chief Marketing Officer at Equitus AI

1 年

Thank you for sharing your Knowledge about knowledge :) ! Knowledge graphs are indeed superpowered tools in the world of data management and AI. Their ability to aggregate data across numerous instances provides invaluable insights and context. It's incredible how knowledge graphs can help your AI systems and data-driven processes soar to new heights by efficiently aggregating data from diverse sources. ?? This power is crucial for making well-informed decisions and uncovering hidden patterns within complex data ecosystems. #dataaggregation #knowledgegraphs #knowledgemanagement #ontology #taxonomy #ai #artificialintelligence ????#EquitusAI

Roy Roebuck

Holistic Management Analysis and Knowledge Representation (Ontology, Taxonomy, Knowledge Graph, Thesaurus/Translator) for Enterprise Architecture, Business Architecture, Zero Trust, Supply Chain, and ML/AI foundation.

1 年

I agree with this posting. Again, this aligns with my own KG approach, as described in my comment to your Part 1.

Mark Spivey

Helping us all "Figure It Out" (Explore, Describe, Explain), many Differentiations + Integrations at any time .

1 年

people mostly use these words in varying ways to make specific points … ironically there is very little consistency in the definitions of knowledge graphs, ontologies, taxonomies, databases, etc …

Maciej Teska

CEO of Synergy Codes | 170+ visual solutions | data visualization & diagramming 3x faster

1 年

Absolutely true! Knowledge graphs are like the superheroes of data management, enabling seamless integration and aggregation of diverse data sources, paving the way for enhanced data insights and decision-making.

Ashleigh N. Faith

Knowledge Graph, Semantic Search, & MLAI

1 年

Ha, reading quickly while scrolling I thought this said data agitation ??

要查看或添加评论,请登录

Mike Dillinger, PhD的更多文章

  • Thanks for them Knowledge Graphs

    Thanks for them Knowledge Graphs

    It's Thanksgiving Day here in the US. A time to count one's blessings.

    2 条评论
  • Knowledge Graphs are Essential for Safe AI

    Knowledge Graphs are Essential for Safe AI

    AIs will only be safe for general use when they have and use goals and values that are identical to those of humans. In…

    27 条评论
  • Knowledge graphs, Linguists, and the Last-mile problem of AI

    Knowledge graphs, Linguists, and the Last-mile problem of AI

    Now that AI can generate fluent text at scale in multiple languages and different styles, are authors, translators…

    21 条评论
  • Audio: How to make AI safe and reliable?

    Audio: How to make AI safe and reliable?

    Janie and Johnny are back for Episode 2 of my Byte-sized AI series! Listen in to these engaging, bite-sized podcasts to…

  • Audio: What are Knowledge Graphs?

    Audio: What are Knowledge Graphs?

    Who knew? It seems that Max Headroom had blue-eyed twins and they're all grown up! I suspect that he sent them to…

    10 条评论
  • Entity Resolution: Priority #1 for Building Real Knowledge Graphs

    Entity Resolution: Priority #1 for Building Real Knowledge Graphs

    I keep seeing mentions of "entity-resolved knowledge graphs", which leads me to believe that other so-called…

    33 条评论
  • Google's Semantic Search: Going to the Dogs?

    Google's Semantic Search: Going to the Dogs?

    Google is the undisputed leader in web search – technically a monopoly in fact. The coverage of web properties (good…

    41 条评论
  • Spelling-driven Reasoning in LLMs

    Spelling-driven Reasoning in LLMs

    If the lighting is just right, and you squint just enough then cock your head to one side, you might say that…

    37 条评论
  • Stuck in the Muck: Big Data means Big Problems

    Stuck in the Muck: Big Data means Big Problems

    Imagine that your organization is a sleek thing of beauty, like a very fast, very expensive, highly polished Ferrari…

    20 条评论
  • Better Knowledge for Better AI

    Better Knowledge for Better AI

    There's a growing consensus that knowledge graphs – which are a kind of artificial knowledge for artificial…

    13 条评论

社区洞察

其他会员也浏览了