The Role of Data in the Context of the Enterprise
Copyright 2024 Kurt Cagle / The Cagle Report
I recently wrote a report article about context's role in data (which I should eventually capture, I suppose). It ended up prompting a question from ontologist Robert Vane , to whit: “What is the role that data plays in the context of business?”
This is in many respects a very different question, especially if you look at it from the standpoint of “business context”. Put another way, why do we put so much effort into building elaborate data systems in organizations, and is this, in fact, a worthwhile investment?
To help put things into perspective, it’s worth rethinking what exactly we mean by data, which is made particularly difficult because there are so many different things that we describe as being data, some with more validity than others.
Most organizations enumerate what is important to them—their products, people, facilities, processes, organizational divisions, and similar entities—and then provide properties that describe both the metric characteristics and the relationships between these entities.
Yet, at the same time, businesses in general do not think of information in this way if they think about it at all. The focus there typically has been aimed primarily at solving a few key problems:
These are the business constraints and the business context that every CEO has to deal with daily. It’s worth noting that from this perspective, data enters into these constraints only secondarily, and then only as a means to an end, not the end itself.
The Business Importance of Data
Is data important here? Sort of, but only along the edges. In the business perspective, data is simply another form of record keeping, and all of this discussions about relational databases vs. document databases vs. semantic databases vs. data lakes is a distraction. The business leader wants to know a few key questions:
This is a perspective that I suspect gets lost once you get into the bowels of the modern IT-led data-centric business. There the questions are very different:
This approach shows an awareness of data within the organization, but often at the cost of losing the overall context of the business of that organization.
Engaging Warp Drive
During the early 2000s, as IT was flexing its muscles, the concept of the Data-Driven organization began to take hold. The mantra that emerged was a simple one: If you, as a company manager, make data the priority of your company, this will increase visibility into how the company is doing in meeting its fiscal goals.
This ended up manifesting itself in a number of initiatives:
Anyone who has seen any of the multiple Star Trek movies, shows, etc., understands what this means. The Captain sits in his or her chair. In contrast, each of his department heads sits at a workstation with several terminals, each showing off a different part of the system – communications, navigation, weapon systems, security, etc. When the big baddie shows up, one of those people turns around and fiddles with dials (did you notice there were no keyboards on those terminals?) and either alerts the captain that there’s a problem or, well, alerts the captain that there’s a problem.
Now, its probably accidental that this starship has always been called Enterprise, but its nonetheless appropriate. Star Trek has always been a metaphor for corporate business organizations. The Bridge is the command and control (C&C or CnC) centre that is part of every ship created since the late 19th century, and that CnC mentality informed the nature of corporations after World War II.
Yet at the same time, there has always been a silliness to the Bridge that most people likely have not considered:
Save for that last issue, the point that I’m trying to make here is that this viewpoint of the enterprise reduces everything to a mechanistic vision that could (and indeed probably should) be automated.
Would I do it with today’s AI? No, probably not, but that’s partly because most people bring more than just their job description to their jobs, and in part because what we call AIs today are simply very expensive search engines. We forget that at our peril.
The Importance of Global Identifiers
The Star Trek metaphor aside, Enterprise data is itself an interesting concept. An enterprise ontology can be thought of as the concepts that are common to all aspects of the organization and, as such, should be shared throughout the organization.
If your company publishes books, then the catalogue of books is a good example of such an ontology. Each book has some unique identifier (such as an ISBN number) that indicates that it is a given book and another identifier (typically a SKU or Serial Number) that indicates which copy of that book this book is.
Your organization wants to keep track of both pieces of information as well as the associated metadata. The ISBN will not tell you how many book copies have been sold, but by counting the SKUs sold, you have a rough estimate of that count.
The challenge that most organizations have with such data is key management. More specifically, without a way to uniquely identify a given resource, that resource cannot be managed. However, once a resource is given an identifier that can be verified, it is possible to correlate that identifier with other identifiers (to create a keychain. This is, in fact, how OpenAuth works – when you log into a particular system, that system can be used to validate you to another system with its identifiers. The second system doesn’t know the first system’s identifier, but it can create its identifier knowing that there is a validator that can verify that the person is who they say they are.
The growth of biometrics – fingerprints, visual and voice recognition systems and so forth - has changed in the last couple of decades as authenticators, making it possible for a system to identify a person passively without needing permission to track them. This is the foundation of the surveillance economy, which can be completely transactionless, but also enables potential bad actors – governments, corporations, marketers, politicians, religious figures, criminals and so forth – to track people to a disturbingly intrusive degree. I have much more to say on this, but not here.
Enterprise ontologies and their corollary of digital twins are not possible without this level of key management, but with them, they become preferable to multiple siloed relational data systems. Knowledge graphs typically contain passive keychains, where verified identifiers are collected. Any such identifier is perforce stochastic in nature, meaning that there is usually a probability greater than zero that the identifier is not, in fact, correlated with the individual in question, but that has always been the case with identity management.
Leaving the question of key security for another time as well, once you can correlate a key to a given real-world resource, then an enterprise knowledge graph makes a great deal of sense, but it also changes the relationship of an organization to its data. Specifically, until comparatively recently, a database was a passive entity that was usually tied into a specific application, and it was the application that received the primary focus in the organization. This makes a certain sense – the application is what people traditionally interact with.
On the other hand, with an enterprise knowledge graph, data becomes a shared resource, and multiple applications end up using that knowledge graph as their primary data source and target. This approach has advantages and disadvantages.
The data model becomes much more fluid, requiring more management over time. Once you have an identifier, you can attach arbitrary metadata from different sources to that identifier. This means that knowledge graphs become more valuable over time, as more connections get added, at relatively little additional cost in complexity. This is why, even as AI seems to have swallowed the data space, knowledge graphs seem to hold their own.
领英推荐
Knowledge Graphs and Business Context
Knowledge Graphs differ significantly from relational graphs, in part because they treat their schemas as part of their graph of information and, in part, because they utilize an open-world assumption. This latter point probably doesn’t matter as much to non-data people, but it differentiates knowledge graphs from almost all other information storage systems.
But, the open-world assumption is a direct consequence of using universal global identifiers. In a relational database, a key is usually just a number stored in a reference table that identifies a given row in a table in that database. If I create a new database, ?even if it is otherwise identical in schema, the identification key will almost invariably differ. Lookups are fast because of this assumption (it is much faster to test for comparison of an integer than it is for a string). Still, the performance gain you achieve means that you lose the ability to talk about data globally.
On the other hand, if you reference a resource in a knowledge graph, that resource has a unique identifier, regardless of the database, which means that you can serialize the contents of that knowledge graph from one database and pour it into a different one, while maintaining the same graph. The graph becomes independent of the storage medium, and information contained in one graph can thus be referenced in other graphs without that information needing to be present. (There are exceptions to this principle in using anonymous or blank nodes, but that’s a discussion for another time as well).
What this means in practice: different information about a resource may reside in different knowledge graphs in different databases, but so long as I have such a global identifier (called a uniform resource identifier, or URI), I can query a database with a given URI and retrieve back what that system knows about the resource, whether it be a person, a product, an event, or anything else. Moreover, I can aggregate that information into a single record that reflects what is known about that resource at any given time without having to make a copy somewhere else.
Of course, one consequence is that if there is misinformation in the system – a given person was born in 1963 in some databases but in 1964 in other databases - then this will readily surface. Schema can be defined as data within the knowledge graph that allow you to validate that information (in this case, by observing that what should be a single-valued property is appearing as an array of values), and from that, to help you correct the misinformation. This is not doable within a relational database, though it is within document stores in XML, JSON, or other structured output – assuming you have global URIs.
There are other implications here as well, but going back to the bigger picture of context, this helps to solve many of the thornier questions that emerge from the executive’s standpoint. Taking some of these from earlier:
How do I make the production of goods more efficient, and secondarily, more flexible, while minimizing the costs of doing so up front?
By creating universal URIs early in the creation process of a good, you can store that URI and then add to it as the good evolves. This makes your organization independent of any one data system, and you spend far less time dealing with the cost of integration.
How do I ensure that my people are assigned to and adequately performing tasks that enhance the bottom line while reducing costs?
Such knowledge graphs can be distributed so that things such as tracking in-progress and completed activities can be done independently from other applications while creating a history relevant to that particular employee. In a data services world, doing so is relatively easy today compared to a decade ago.
How do I ensure that I am in compliance with regulations while still making a profit?
Compliance regulation is hard because there are so many independent systems that have to be managed. A regulation is a set of rules that can be encoded and applied to data as it is produced. This way, any time a non-compliant activity occurs, it becomes possible to capture that and notify the people responsible for ensuring that the process can be moved back into compliance. Again, this is far easier (and much less expensive) than trying to build forensic systems that try to determine non-compliance after the fact.
How do I ensure no disruptions in my supply chain that might give my competitors an advantage while cutting into my profits?
A supply chain is a complex, distributed system, often dependent upon hundreds of vendors, each working independently. Simply by associating a URI with a given resource (such as a shipping container), you can track the evolution of the container and its contents. This process is already well underway, but by agreeing to use the same URI as the originator of the transit container, you can similarly track the logistics internally, again at a much lower cost than having to build out an independent supply chain management system. This is an excellent example of a digital twin, where what you’re replicating is not a thing but a process, adding inappropriate metadata to tie it into your own processes.
There are many other examples, where something as simple as going to a URI rather than a local identifier can vastly increase the information you have about the world at a very minimal overall cost.
Knowledge Graphs and AI
Any such discussion about data and context, in this day and age, would be complete with at least some mention of AI, specifically large language models (LLMs). What is usually called AI today is neural-network-based AI, specifically AIS as a tool to generate memetic spaces (which I consider an LLM). The other side of AI is Symbolic AI, which aims to encapsulate reasoning through manipulating symbols, typically within graphs.
A neural network is a graph, though a memetic space, by itself, is not. Memetic spaces work by clustering information based on linguistic patterns, and it is likely that such clustering does occur in our brain as well, though not in the same way. When you encode a memetic space in this way, you are likely creating a space-filling fractal structure (or IFS).
Without getting into a lot of fairly gnarly math, what that means in practice is that when trained on documents, closeness is usually determined by the similarity of passages (chunks) to one another in terms of word usage patterns. This is actually how much search systems determine relevance between documents, but it’s a blunt hammer, especially when you have documents that may have few terms in common but still have conceptual overlap.
This ambiguity causes hallucinations in large language models, as the incoming prompt may not have enough information to find a meaningful cluster, causing occasionally random content to pop up. You can cut this down by reducing the temperature (the measure of closeness) in the vector space, but this also comes at the expense of losing potentially meaningful conceptual information.
There’s an active debate within the AI community between the symbolic and neural network proponents, with a growing consensus that if you build a system that takes advantage of semantic graphs to help “train” the LLM, you are more likely to extract connections that may not necessarily ordinarily occur In documents but that nonetheless are important for reasoning.
The other aspect of LLMs that is just as critical is that they are most useful when they have a certain minimum set of training documents (often called a corpus), and when that corpus has a great deal of variety. In other words, a certain level of hallucination seems to be necessary for LLMs to function. This creates a trade-off – the larger the number of sources in the corpus, the longer it takes to train, and while in-memory models do retain historical information, when you retrain an LLM, you lose much of that learned knowledge. This means that, barring some major architectural advancements, LLMs are simply not transactional enough to serve as databases by themselves.
However, if LLMs are used as a mid-tier process talking with knowledge graphs, and if enough metadata is stored in the LLM during training to provide hooks into the conceptual memetic space to provide scaffolding, what emerges is a kind of a synthetic system that has both a knowledge graph and LLM component to it. I anticipate that as this concept matures (it’s very immature at the moment), it will likely have a profound impact on our whole practice of data management.
Again, the mission of data within the organizational scope is to create a better understanding or context not only of internal processes but also of external factors. I see the role of LLMs primarily as a vehicle for integration and transformation, making that kind of organizational intelligence possible.
One final note: I’ve heard occasionally from business leaders that they do not make decisions based on analytics but instead trust their guts. Ignoring for a moment the fact that one’s stomach and intestines do not have much processing power, what they are saying is that such leaders happen to have an instinctive understanding of what is going on in their organization, mainly because they had a significant hand in building the organizations in the first place and have in effect absorbed a sense of analytics based upon contextual clues. In my experience, If you’re a non-founding manager, you probably do not happen to have had time to develop such a “gut” without many years of experience, and there is also likely a certain amount of survivor’s bias built into this mix.
Put another way, while you can have too much information about the business and the world around you (meaning that you spend too much time and energy processing irrelevant data), most leaders are far from that point. The next central point of AI is figuring out how to determine what information is irrelevant, which is a considerably more difficult problem to answer.
In Media Res,
Kurt Cagle,
Editor, The Cagle Report
?
?
Disambiguation Specialist
3 个月Kurt Cagle - Excellent points, among several others, about keys and identifiers. Here's a little graphic that shows how a single instance of Person, named 'Kurt Cagle' is represented in a Semantium graph. Is this what you had in mind?
Head of Operations at Ortelius, Transforming Data Complexity into Strategic Insights
3 个月Thanks for sharing! An expansion of KG and LLMs would be utilizing the structured metadata from KGs, their URIs, embed them at the creation of objects in other systems which in turn enables an LLM to have to have context for whatever question it processes - further to that that metadata (taxonomies and ontologies) can even be used to tag structured and semistructured data (which of course needs to be validated and the algorithms which process it needs to be trained) and once confidence levels are up store in a "structured trusted data" database which the LLM can utilize.
DevOps Engineer @ i/o Werx
3 个月Brilliant as always. Apologies to role -> editor: Any such discussion... would [not] be complete? how much [many] search systems determine relevance
Slaying Enterprise Information Dragons with Enterprise Knowledge Graphs Semantics#DataGov#DataManagement#DataFabric/Mesh
5 个月Very cool write up! It seems that the general consensus is that the Business Context /Narrative needs to be modeled, and the underlying subjects of that narrative (business units), along with the data (record keeping that describes the functions & interactions/relationships), should be tied together. You mention that one of the defining factors of KG's is it's open world assumption however, it's most likely that when deployed, it will be hybrid in having both open-world and closed-world assumptions (see SHACL). This topic makes me think of the intersection of enterprise semantics and enterprise governance and the surrounding context architecture should be discussed more. AI being an expensive "search engine" will then have the same challenges that all enterprise search engines have when trying to be layered over the enterprise i.e. governance and what it has access.
Let's prepare and build the continuous operational Interoperability supporting end to end digital collaboration
5 个月Can an enterprise work without information and communication system? Can an information and communication system be envisaged today without relying on information and communication technologies based on computers? How then do we have to deal with digitalized information enterprise are dealing with? Data don't come first, they results from a motivation and come with an intended/expected usage lying in the concerned enterprises digital business ecosystems. They also come with processes, ensuring data quality. Considering data out of context is the open door to non quality and to a lot of waste of time and money.