State of the Graph: Federation and Identity

State of the Graph: Federation and Identity

This article is the fifth in the State of the Graph series on Linked In by Kurt Cagle. You can read the previous articles at the following links:

Where does your data live?

Until comparatively recently, this question was kind of nonsensical - data lived in files. End of story. The files likely had different formats because each programmer was essentially maintaining their own information, but on the flip side the only person who had access to that data was the programmer, and that meant that your data was safe so long as your programmer was trustworthy.

Networks had also been around since the 1960s, since computers were expensive and it was simply not efficient to have one programmer per computer. This meant that different people would have access to the same computer at the same time, usually by computers temporarily swapping out one person's data and swapping in someone else's so they had a slice of time on the CPU. That data was persisted when not in use in magnetic core memory in hard drives the size of industrial washing machines.

Databases are nothing but specialized files. The core information still exists in look-up tables, but those lookup tables in turn are also connected via indexes which identify where in those files certain information is contained. In the early 1970s, Edward "Ted" Codd worked out a relational model for condensing this information for IBM, though the company would initially take a pass at implementing these ideas for fear that it would cannibalize sales of their existing data products. A decade later, Larry Ellison would take Codd's ideas and use them to establish Oracle, building out the SQL database as a standard that would come to dominate the IT world for decades to come.

SQL emerged roughly at the same time that networked PCs did, but in general the two worlds did not really interact much. SQL worked because the table structure made it possible to create relationships through specialized indexes, but because referential integrity was such a critical factor, such relational data spaces were essentially localized to the information space within the database itself. This made it possible for more people to simultaneously access a given database (using rolling cursors and similar mechanisms through some kind of a data record object) but the idea of actually transmitting that data to other databases was still something of a pipe dream, though there was an increasing push to find a way of doing so.

By the early 1990s, serialization had emerged as one of the big challengs of that period, with a number of different binary serializations being proposed before XML eventually emerged as the first encoding of a text-based data serialization in the late 1990s (mostly by accident) to become mainstream (CSV has been around for ever, but it contains no notion of schema, relational keys or much of anything beyond how do identify a simple table).

Eventually the dichotomy of "data at rest, data in motion" took hold. In this dichotomy, "source data" was what existed within databases, while in-motion data was what was sent from databases to other databases or application developers - a representation of that data that could be parsed and processed. This set up the great debate of the 2000 decade - structured data flows in which data was serialized in order to create document object models that could be slurped back into binary objects (Service Oriented Architecture or SOA) or more ad hoc approaches where representations of web content were serialized, reconstructed as a generalized object and then processed directly (REST).

By 2009, Anne Thomas Manes, now an analyst with Gartner, famously declared that "SOA Is Dead, Long Live Services". By then, the compexity involved in deserializing binary objects, transmitting them, then reserializing them, outweighed the benefits in terms of security (which would eventually become a non-issue with https:). JSON began overtaking XML by them, one reason being that the complexity involved with the SOAP/WSDL/UDDI stack began to tar XML with the same brush. Ironically, JSON 2020 is beginning to approach the level of complexity of XML in 2010 but with less functionality.

During this same period of time, there was a growing sense of panic in the enterprise as the number of databases (and how those databases were siloed) began reaching a point where information was being lost due to all of those black boxes, most of which still designed around the idea that data was self-contained and internally consistent (for every foreign key there existed a primary key). That is to say, the database has referential integrity.

Most database professionals do not question why referential integrity is so important. It's self-evident. Without referential integrity, structure cannot be guaranteed. However, it also means that any data that is created in a different system (usually with their own keys) cannot be related to what's in your database because at that stage no referential integratity can exist. It is this point that makes integration so difficult.

It is worth understanding, something that many people who should know better don't, that computers do not automatically have a comprehension of the distinction between a book and a person. In a relational database, the computer only knows that there is a key in a field in a table that needs to match up with the key that identifies the row of another field. That's it. What primarily differentiates a relational database from a semantic database is that the keys in the former are made up of integers that are associated with a given schema tabl's identifiers, while a semantic database uses globally unique strings, and makes no assumption about whether a primary key with that identifier actually "exists" in a meaningful fashion.

This is the heart of the distinction between the closed-world assumption and it's open-world counterpart, and it is this same reason why attempting to federate relational databases is doomed to failure.

Semantics does not imbue a computer with insight and human awareness. It is not a magical secret sauce. Instead, semantics is simply a promise that says that if I have a key, that key uniquely identifies something - a person, place, thing, machine, idea, animal or plant, intellectual work, a character in a novel, a product ... something. If you have this key, you are talking about this one thing. This is not to say that things cannot have multiple names, only that if two things have the same semantic key they represent the same entity.

// You can create a semantic database using a relational one - indeed, many of the earlier triple stores were built using SQL databases, though it's probably not the most efficient way of building triple stores. What changes is that properties are no longer equivalent to column heads, but instead are treated as conceptual entities (they are, in essence, abstracted)

There are a host of implications for this. One of the first is that if you can otherwise disambiguate a resource, you can get that key. This is true of any database, but what's not true is that if the same resource with the same key is located on two different systems they are still part of the same resource. It also means that if you have records from two different systems, merging them generally means simply copying the triples from one system to another. Merging can get a little more complex, of course, especially when you have two records that represent entities at different times. This in turn makes federation possible.

Federation

Semantic databases emerged after the rise of the web and data services, when it was becoming evident that dealing with the creeping proliferation of relational databases could only end one way ... badly. RDF is, ultimately, not a data format. Rather, it is an abstraction of a data format. I can send you ntuples, turtle, trig, json-ld, rdf-xml, rdf-json, rdf-csv, rdfa, or half a dozen other formats (with more on the way) and they nonetheless ALL are simply representations of RDF. What's more, that serialization can be parsed by any triple store and it will have the same available meaning.

The implications for this are huge. First, this is not really the case for relational data, even now. There is still no consistent serialization of relational data, and likely never will be. Even expectations for JSON are different across vendors. XML was designed from the beginning to be fully agnostic, but the overall marketplace for XML databases is still small and likely shrinking. With RDF, data can be persisted as a file, in various triple stores, or as in-memory structures equally well, because what is being stored is an abstraction.

RDF can equally serialize tabular content and complex data structures in a lossless fashion. JSON and XML can do this as well, but without universal keys, neither can integrate with other simiarl structures outside of a database. RDF does this implicitly.

Finally, and to the point of this section, with RDF it becomes possible to store just enough information within an RDF database to create a useful stub (identity, type, description and label) along with an indication of what databases might hold additional content. For instance, in a health care network, one semantic hub may contain patient information, a second provider information, a third, diagnostic information. Each would also have the same schema information.

By associating the potential sources with the relevant classes in the schema, any query could federate the call to retrieve any nodes that happened to match the (unmatched in the native system) elements. The individual entries wouldn't need to know this information, because it can be gleaned from the schema based upon properties:

select ?patientLabel ?doctorLabel ?providerLabel where {
   ?patient rdfs:label ?patientLabel.
   ?patient patient:hasDoctor ?doctor.
   patient:hasDoctor rdfs:range ?doctorClass.
   ?source source:hasTargetClass ?doctorClass.
   ?source source:hasEndpointURL ?sourceURL.
   SERVICE ?sourceURL {
      ?doctor rdfs:label ?doctorLabel.
      ?doctor doctor:worksFor ?provider.
      ?provider rdfs:label ?providerLabel
      }
} order by ?patientLabel ?doctorLabel

The SERVICE in this case is what makes this a federated call - it tells the calling server to make a call to a SPARQL endpoint (?sourceURL) with the associated subquery, returning triples. It's even worth noting that more than one service may be invoked from above, if the source in question has more than one source:hasTargetClass entry. Note that the actual mechanism of that call is application specific - as it should be. The query (or the querant) shouldn't have any knowledge of that.

This ability to federate also means that scaling issues (and bottlenecks) become far less of an issue. Databases have to scale in size because they need to keep information resident to query. Federate that, and you replace one large, expensive data server having a limited number of ports with a number of smaller, cheaper servers that can be queried in a distributed fashion. You are also trading requiring using larger indexes (which reduce performance) with using more databases with higher latency but smaller indexes. Since you do have multiple systems, you also reduce the potential for bottlenecks, especially if you use load balanced replications of a given database server.

The idea of distributing your knowledge graphs in this instance also deals with a common political problem within organizations. Most departments that deal with a certain subset of data want to keep that data close. HR isn't going to care about manufacturing data, but they will be very protective of personnel data. By letting each department curate its own data, it moves the SMEs closer (from a functional standpoint) to the information they know.

Federation by itself does not resolve the ontology mismatch problem, but it does make it easier for departments to specify those parts of an ontology that are more relevant to them. It also solves another potentially vexing problem: how do you provide information to an outside agency, company or individual without providing too much information?

In this case, you stand up a data hub that periodically makes calls to the other servers to update relevant content, but disable the ability to federate from this particular box for the wrong users. This is also a good place to summarize content - rather than retrieve all present and former CEOs, you can simply create a new property called company:hasCEO that contains the latest one, applicable only in that one domain.

Where Federation Is Now

The ability to federate is making its way into both knowledge graphs and property graphs. Neo4J recently released their 4.0 software, which includes (among many other features) support for both sharding and federation. Sharding involves distribution of data within the same system, and is usually more efficient than federation for intradatabase operations. In terms of semantic databases, most now (within the last year or so) offer some form of federation and/or sharding, though not always consistently through Sparql. It is likely that within the next couple of years, federation consistent with the W3C Sparql Federation standard will become de rigour for SPARQL-centric databases.

It is also likely that developing a best practices approach for Federation will likely take a couple of years as well, especially as you begin to see deep queries that may span more than one hop in a federated network. Because RDF assertions are atomic, RDF naturally streams. However, because latency does become an issue, it is more and more likely that SPARQL queries will become, perforce, naturally asynchronous as a consequence.

What about federation of non-RDF sources? This becomes a bit more problematic. It is certainly possible to set up an RDF endpoint on a relational database, making it look and act like SPARQL system, but the underlying issues - the incompatibility of keys being the primary one - will always limit the utility of this approach. It is possible to retrofit such keys back into existing databases, but as applications also need to be made aware of such new keys, then this becomes an issue that has ramifications throughout the computing ecosystem. Put another way, as painful as it may be from an implementation standpoint, going greenfield may be the better solution, building out new semantic hubs as older relational systems become obsolete.

Federation, Identify and Master Data Management

An alternative strategy there may be in key seeding - this involves establishing a mechanism for reading existing keys and contextual identifiers from the databases then farming the identifiers from those context clues and adding these into the database. This is a form of Master Data Management and is at best forensic in nature, but until such time as getting new applications up and running, it can at least facilitate the process somewhat.

In essence, what this does is to make your triple store into an identity name-server. In this case, when a new object is created, the creating information is sent to the semantic hub in order to create an associated URI for that record with enough metadata to potentially identify it in other systems. Other information can be added by different systems, but the real value here is to aggregate keys where such resources already have penetrated to tie those resources together. Knowing that a person has an ID of 1163 may not tell me anything, but knowing that it has this id from one particular system makes it possible to uniquely correlate that with a person with an ID of 192JD in a different system.

The challenge with this approach comes from both technical and political directions. On the technical side, such crosswalks can be notoriously difficult to keep in sync, and there will always be records that slip through the cracks. On the political side, such systems represent a performance overhead that can be costly, require going into either already extant and hardened code to make changes or going into databases (or both) that are already seen as mission-critical. This can be true of MDM systems in general - applying them after the fact is very much of the "closing the barn door after the horse escapes" type of situation.

This is also an area that has a significant impact on smart contracts. One role that blockchain or distributed ledgers have is to create persistent identifiers that identify the participants and resources inherent within a contract. Chains of authority serve much the same purpose - they provide a level of surety that the person (or other resources) being presented is in fact the person that they say they are. Federation can store public keys against identifiers that can then be used in turn-key situations to not only assess identity, but then to allow access of specific information based not just on the person being queried, but the person doing the query.

This again feeds into the notion of a personal knowledge base (PKBs). PKBs can be thought of as super-wallets - active graphs, potentially tied into distributed ledgers that act as the repository for sensitive personal information as well as what could be thought of as accessible profiles. At the moment there's a major backlash against the often poor and egregious that companies make of personal data, but there is also a modicum of convenience that comes from having that information in the first place. By encapsulating this information in PKBs, external processes can query information for local activities (such as determining preference in clothing styles) while protecting other information (such as voter preference). This is a problem that is uniquely well suited to graph technology,

Summary

Federation is really only in its early stages at this point, but it has the potential to dramatically change how enterprise data is utilized. Because federation abstracts data services to endpoints for query, it can significantly reduce the overall engineering deployment of applications and can reduce the need to perform ETL dramatically in an organization.

It will also have the effect of moving data closer to those who will curate it. Technologies such as GraphQL can also take advantage of such schema-driven federation transparently, making such federated knowledge hubs appear to be relational or document driven based upon the needs of the developer or consumer, rather than the data provider.

Finally, I believe that federation is a necessary precondition for the implementation of smart contracts and personal data graphs, both topics I'll explore in more detail in the next article in this series.

Articles in Series:


Kurt Cagle is the CTO of Semantic Data Group, and the principle journalist for #theCagleReport. He can be reached at [email protected], or via Linked In.

要查看或添加评论,请登录

Kurt Cagle的更多文章

  • Reality Check

    Reality Check

    Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

    14 条评论
  • MarkLogic Gets a Serious Upgrade

    MarkLogic Gets a Serious Upgrade

    Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

    14 条评论
  • Beyond Copyright

    Beyond Copyright

    Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

    5 条评论
  • Beware Those Seeking Efficiency

    Beware Those Seeking Efficiency

    Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

    86 条评论
  • A Decentralized AI/KG Web

    A Decentralized AI/KG Web

    Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

    48 条评论
  • Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

    Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

    I am currently working on Deepseek (https://chat.deepseek.

    41 条评论
  • The (Fake) Testerone Crisis

    The (Fake) Testerone Crisis

    Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

    22 条评论
  • Why AI Agents Aren't Agents

    Why AI Agents Aren't Agents

    Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

    22 条评论
  • What to Study in 2025 If You Want A Job in 2030

    What to Study in 2025 If You Want A Job in 2030

    Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

    28 条评论
  • Ontologies and Knowledge Graphs

    Ontologies and Knowledge Graphs

    Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

    53 条评论

社区洞察

其他会员也浏览了