Superpowers of Knowledge Graphs, part 2: Data Aggregation
Knowledge graphs have superpowers above and beyond those of ontologies, taxonomies, and mere databases. They can't jump tall buildings in a single bound, but they do have something akin to x-ray vision and other powers. We talked about one such superpower in a previous post:? data integration – across many sources (and in a later post about taming LLMs).
Another key superpower of knowledge graphs is data aggregation – across many instances.
Imagine you have millions or billions (or more!) of rows of data, each of which has many thousands of columns.? Just as we can't see x-rays or infra-red energy, hear ultra-high frequency sounds, or visualize thousand-dimensional spaces, humans simply cannot identify or compare patterns in such gigantic data sets.? We have to aggregate this data to make it useful and valuable – which is why the dark arts of Data Science are so essential today.
Takeaways for the hurried and harried
Choose your knowledge architecture carefully to enable the data aggregations that you need.
Foundations of Data Aggregation
Data aggregation is based on the fundamental distinction between individual instances (like Jack Nicholson or the Eiffel Tower) and collections of these instances (variously called categories, classes, kinds, types, etc.) like people, actors, teams, etc. This distinction is so important for human understanding that it shows up in many different forms across basically all domains:? item vs group, specimen vs species, reference vs sense, episodic memory vs semantic memory, token vs type, member vs team, particulars vs universals, and (in math) scalar value vs distribution, tuple vs relation, and element vs set.?
Data aggregation boils down to summarizing the features of a collection or category based on the features of the individuals in it.
Simply stated, an individual instance is unique: its cardinality is 1 and only 1. There's only one Jack Nicholson and only one Eiffel Tower; they are unique instances of categories like actors or buildings. Measurement theory tells us that a single data point is the unique result of measuring some predicate (property, attribute, characteristic) in some unique instance at a specific time and place, with a specific method – recorded as a fact. So "data", technically, is a collection of facts with values for specific predicates about individual instances – data is a collection of unique measurements, the input for data aggregation.? Ontologies downplay or ignore instances entirely, so they are most useful in getting ready for data aggregation – by defining the target categories that can guide the process.? Not all taxonomies include instances (the ones without instances are sometimes called typologies), but when they do they provide an important though very limited and lossy method for data aggregation – by relating each instance to a single category. Knowledge graphs can be seen as a significant extension to both ontologies and taxonomies:? knowledge graphs often emphasize instances and diversify data aggregation by enabling each instance to relate to a wide range of categories through a range of different relations.
A collection or category (I'll call them categories here), on the other hand, can have any cardinality, including 0 (we can even imagine and define a category that has no instances, like Martian actors). The output of data aggregation is values for features of categories. But we can't actually measure values for features of categories – we can only compute, predict, or infer them, based on the measured values for the instances in the category. In fact, basic statistics tells us that features of categories have different kinds of values from features of instances:? because they denote categories, the values of category features need to be descriptors of distributions, such as measures of central tendency, dispersion, skewness etc.? Describing a distribution for a category feature with a default value or only its mean is a dramatic oversimplification that unintentionally increases the noise in the data – it's essentially self-inflicted error.? It's also a simplification that is entirely inconsistent with logic, which assumes that an assertion like "adult male humans are 170 cm tall" is true for all adult male humans, even though means are, by definition, not true values for most of the instances in a given category.
Methods of Data Aggregation
It's helpful to think of two distinct families of data aggregation methods:? quantitative aggregation and conceptual aggregation.??
领英推荐
The most familiar quantitative methods for data aggregation are the descriptive statistical methods for comparing multiple instance feature values in terms of a single dimension, by computing a mean, standard deviation, kurtosis, etc. (i.e., category features) for those values – these are kinds of aggregation by computation. When we measure temperature, we map numbers to the feature, as in 32 degrees Celsius. If we take the temperature of 100 instances of a category, we can describe the temperature of the category by saying that the category feature's mean is 31.2 degrees Celsius and the category feature's standard deviation is 1.7 degrees. Describing categories with distributions gives us additional information about how typical the values for instance features are – which is useful for spotting exceptions, anomalies, and errors.
We often generalize this approach to multiple dimensions by computing n means, standard deviations, etc. for the same category or cluster of instances – using multidimensional techniques of machine learning such as clustering and classifier algorithms: a centroid in clustering is essentially an array of means of many features and it describes the "center" of a category of instances. Classifiers aggregate values by trying to determine the optimal boundaries (rather than the center) of the category.??
Less familiar are the conceptual methods for aggregating instance features with non-numeric values, such as terms.? When we "measure" or provide values for a feature like color, we use strings as values for the feature, like "green".? When we "measure" the color of 100 similar objects, say apples, how can we find something like a mean? How can we describe the distribution of their values?? One common way is to convert these non-numeric values into numbers and to tally the frequencies of the distinct values – regardless of how similar (or dissimilar) those values are in qualitative terms – and aggregate them statistically, as described above. (This ignores the conceptual difficulty of deciding which values we consider to be "the same" when we count them, since counting is categorization in disguise).
Another way is to aggregate them conceptually – in terms of what they mean:? group and count the greens, the reds, the other color values, for example. In a very useful sense, a parent term like "green" represents something like a "mean" concept that captures the commonalities of chartreuse, bottle green, kelly green, forest green, lime green, etc. (and more shades of green).?
We've seen that ontologies usually don't include enough information about instances for this kind of aggregation by lookup in which we store instance/category pairs as facts in lists, tables, trees, graphs, or other formats.??
Taxonomies often do include instances and are the workhorses of conceptual data aggregation, in fact they seem to have been invented for the specific purpose of aggregating specimens (instances) into species (categories) – aka classification – a clear example of aggregation by lookup. One key constraint, however, is that in taxonomies instances can be aggregated with only one relation (instanceOf) into only one parent category (per the messy MECE principle). In practice, though, taxonomy work is very difficult because instances can be aggregated in many, many possible ways, no single one of which is "best" for all use cases or can capture different perspectives or points of view. So forcing a single aggregation creates a single simplified representation that is very lossy and very difficult to build.??
Knowledge graphs take conceptual data aggregation to the next level.
Knowledge graphs are not hobbled by the single-parent constraint common in taxonomies and some ontologies, so the same instance can be aggregated in many different ways. Even the much rarer poly-hierarchical taxonomies in which multiple parent categories are permitted still aggregate instances using only one type of relation (instanceOf). Knowledge graphs, on the other hand, can and often do include hundreds of instance-category relations, making it possible to aggregate the same instances in many, many ways – rather than just one.? This dramatically reduces the lossiness that characterizes taxonomic representations because knowledge graphs can capture more and more varied information, adapt to more use cases, and encode more different user perspectives.??
There are dramatic differences, too, in the information that data aggregation yields.? When we aggregate by lookup in a taxonomy, for each instance the lookup returns a parent category label, sometimes with a human-readable definition from an external data dictionary, plus a label for the grandparent category.? Category labels indicate that instances of a category are similar but do not document why (so we can't validate them at scale) – I call this similarity by decree. All the downstream algorithms have to work with is labels.
When we aggregate by lookup in a knowledge graph, for each instance the lookup returns a category label which is a pointer to a subgraph:? i.e., to all of the triples in the graph that have this category as a subject. Category labels again indicate that instances of a category are similar and the linked subgraph documents why they are considered similar – I call this similarity by description. This means much, much more information for the downstream algorithms to work with: rich subgraphs that can be expanded for more details as necessary.
Which specific characteristics give knowledge graphs their superpower for data aggregation?
Knowledge graphs have other superpowers, too. Which of them have you seen in practice?
Chief Marketing Officer at Equitus AI
1 年Thank you for sharing your Knowledge about knowledge :) ! Knowledge graphs are indeed superpowered tools in the world of data management and AI. Their ability to aggregate data across numerous instances provides invaluable insights and context. It's incredible how knowledge graphs can help your AI systems and data-driven processes soar to new heights by efficiently aggregating data from diverse sources. ?? This power is crucial for making well-informed decisions and uncovering hidden patterns within complex data ecosystems. #dataaggregation #knowledgegraphs #knowledgemanagement #ontology #taxonomy #ai #artificialintelligence ????#EquitusAI
Holistic Management Analysis and Knowledge Representation (Ontology, Taxonomy, Knowledge Graph, Thesaurus/Translator) for Enterprise Architecture, Business Architecture, Zero Trust, Supply Chain, and ML/AI foundation.
1 年I agree with this posting. Again, this aligns with my own KG approach, as described in my comment to your Part 1.
Helping us all "Figure It Out" (Explore, Describe, Explain), many Differentiations + Integrations at any time .
1 年people mostly use these words in varying ways to make specific points … ironically there is very little consistency in the definitions of knowledge graphs, ontologies, taxonomies, databases, etc …
CEO of Synergy Codes | 170+ visual solutions | data visualization & diagramming 3x faster
1 年Absolutely true! Knowledge graphs are like the superheroes of data management, enabling seamless integration and aggregation of diverse data sources, paving the way for enhanced data insights and decision-making.
Knowledge Graph, Semantic Search, & MLAI
1 年Ha, reading quickly while scrolling I thought this said data agitation ??