State of the Graph: JSON-LD, GraphQL and Graph Serialization
- This article is the fourth in the State of the Graph series on Linked In by Kurt Cagle.
Serialization and parsing have long been both the strongest and weakest aspects of graph technology. Relational database technology emerged in the 1970s and 1980s when interoperability between database systems was a very low priority, and network technology truly only solidified in the late 1980s and early 90s around Ethernet on one hand and TCP/IP on the other. In general, you accessed databases via specialized bridges or connectors (such as the ODBC/JDBC suite that emerged in the early 1990s), and most serialization solutions stored this information in very specialized (and usually proprietary) binary formats.
The only real exception to this was the rise of Character Separated Value (CSV) formats, in which the first row in a flat-file contained a list of properties separated by a unique character. Each subsequent row contained the value associated with a row and a column property, or an empty string if the data was either null or otherwise empty.
One problem with CSVs was the fact that there was no way to indicate a reference specifically to another table. You could incorporate the serialization of the index, but the only metadata that identified what that index referred to was in the meaning of the associated column head. This lack of reference kept such serializations from truly being round-trip capable without human intervention, as it required the creation of awkward lookup tables to associate a property with an external reference.
It wouldn't be until 1997 that the idea of using a standardized text format as a way to represent data consistently. In that year, the XML standard was first introduced and eventually established as a standard by the fledgling W3C. The working group that created XML used HTML as its starting point with its encapsulated syntax, but depicting the object models of things, rather than just HTML documents. Other working groups than complemented ML with the creation of a schema language (XSD) for encoding structural metadata, a path model for encoding a linear path within a tree (XPath) and a common transformation language (XSLT) for mapping between two different information encodings as a form of ETL.
This led in turn to the first emergence of web services via an XML standard called SOAP (Simple Object Access Protocol). as well as the elucidation of an idea that would prove seminal for data serialization: the Representational State Transfer, better known as REST. The idea behind REST is subtle but profound: the web works because it provides a web address for retrieving conceptual entities. For the web, that entity was most likely a web page, an image or a script.
When the request is made, however, what gets sent back is not in fact the thing, but a representation of that thing. The actual entity might be composed of information from multiple processes, external servers and at different resolutions, but to the person on the other end of that call, what gets sent back is a convenient representation of all of those things in a (relatively) self-contained package. This is actually one of the tenets of encapsulation - an object is seen as an independent entity that exposes its own interfaces without unnecessarily exposing its inner workings.
This was also one of the tenets that Tim Berners-Lee and the designers of the Semantic Web established, which is that a resource can be represented by a unique identifier (an address in an information space) and that such an environment, it should be possible to get that resource encoded in any conceivable representation. He called this a resource description framework (eventually known as RDF). In its simplest form, RDF is simply a collection of triples representing subject, predicate, and object respectively.
However, it could also be represented in other words, most notably the Terse RDF Language (or TRDL) usually called Turtle. This language usually had two sections: a prolog (more recently known as a context) that identifies and simplifies the namespaces involved, and the body, which contains the assertions themselves. An example of this might look something like the following (Listing 1):
#Listing 1. A person represented in Turtle @prefix person: <https://myexample.com/ns/person#>. @prefix class: <https://myexample.com/ns/class#>. @prefix gender: <https://myexample.com/ns/gender#>. @prefix address: <https://myexample.com/ns/address#>. @prefix city: <https://myexample.com/ns/city#>. @prefix state: <https://myexample.com/ns/state#>. @prefix country: <https://myexample.com/ns/country#>. @prefix xsd: <https://www.w3.org/2001/XMLSchema#>. person:_JaneDoe a class:_Person; person:hasFullName "Jane Doe"^^xsd:string; person:hasFirstName "Jane"^^xsd:string; person:hasLastName "Doe"^^xsd:string; person:hasGender gender:_Female; person:hasAddress [ address:hasStreetAddress "123 Sesame St."^^xsd:string; address:hasCity city:_NewYorkCity; address:hasState state:_NewYork; address:hasCountry country:_UnitedStates; address:hasPostalCode "01223"^^xsd:string ]; .
Turtle is important for a number of reasons. It was the basis for Sparql, which has a similar (albeit not identical, format). It is the most commonly used format for semantic-based systems (there are few if any products that don't support it), and it is used heavily in semantic mapping, usually through tools generating text files. It has also been extended with TriG, which adds named graph support to the language.
One of the oldest serializations for RDF was, not surprisingly, using XML. The xml-rdf serialization was an interesting hybrid in that it also incorporated a certain degree of OWL (the higher level logic language that represents one of RDF's first "applications). The following illustrates the same content expressed in that language:
<!-- Listing 2. RDF-XML representation of Jane Doe --> <rdf:RDF xmlns:rdf="https://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:address="https://myexample.com/ns/address#" xmlns:city="https://myexample.com/ns/city#" xmlns:state="https://myexample.com/ns/state#" xmlns:country="https://myexample.com/ns/country#" xmlns:gender="https://myexample.com/ns/gender#" > <rdf:Description rdf:nodeID="bnode14915214499793304332"> <address:hasStreetAddress rdf:datatype="https://www.w3.org/2001/XMLSchema#string" >123 Sesame St.</hasStreetAddress> <address:hasCity rdf:resource="https://myexample.com/ns/city#_NewYorkCity"/> <address:hasState rdf:resource="https://myexample.com/ns/state#_NewYork" /> <address:hasCountry rdf:resource="https://myexample.com/ns/country#_UnitedStates"/> <hasPostalCode rdf:datatype="https://www.w3.org/2001/XMLSchema#string">01223</hasPostalCode> </rdf:Description> <class:_Person rdf:about="https://myexample.com/ns/person#_JaneDoe" > <rdf:type rdf:resource="https://myexample.com/ns/class#_Person"/> <person:hasFullName rdf:datatype="https://www.w3.org/2001/XMLSchema#string" >Jane Doe</hasFullName> <person:hasFirstName rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Jane</hasFirstName> <person:hasLastName rdf:datatype="https://www.w3.org/2001/XMLSchema#string" >Doe</person:hasLastName> <person:hasGender rdf:resource="https://myexample.com/ns/gender#_Female" /> <person:hasAddress rdf:nodeID="bnode14915214499793304332"/> </hasAddress> </class:_Person> </rdf:RDF>
This format is rather mind-numbing, and while a lot of the oldest ontologies are still written in this way, rdfxml is fading quickly, in part because XML itself is becoming less and less used over time.
As an aside, I think that the judicious use of curies would go a long way towards simplifying this format. The above could have been written just as effectively like this:
<!-- Listing 3. xmlrdf Simplified --> <rdf:RDF xmlns:rdf="https://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="https://www.w3.org/2000/01/rdf-schema#" xmlns:rdfc="https://myexample.com/ns/rdfsCurieExtension#" xmlns:class="https://myexample.com/ns/class#" xmlns:city="https://myexample.com/ns/city#" xmlns:state="https://myexample.com/ns/state#" xmlns:country="https://myexample.com/ns/country#" xmlns:gender="https://myexample.com/ns/gender#" xmlns:address="https://myexample.com/ns/address#" xmlns:person="https://myexample.com/ns/person#" xmlns="https://myexample.com/ns/person#" rdfc:hasDefaultScalarType="xsd:string" > <class:_Person> <person:hasFullName>Jane Doe</person:hasFullName> <person:hasFirstName>Jane</person:hasFirsName> <person:hasLastName>Doe</person:hasLastName> <person:hasGender rdf:resource="gender:_Female"/> <person:hasAddress> <address:streetAddress>123 Sesame St.</address:streetAddress> <address:city rdfc:resource="city:_NewYorkCity"/> <address:state rdfc:resource="state:_NewYork"/> <address:country rdfc:resource="country:_UnitedStates"/> <address:postalCode>01223</address:postalCode> </person:hasAddress> </class:_Person> </rdf:RDF>
Other than the dependency upon namespaces curies, this maps fairly closely to the Turtle code above. In the case where you have a large number of classes in a single ontology (such as schema.org), this can be simplified even further:
<!-- Listing 4. xmlrdf Simplified, default namespace --> <rdf:RDF xmlns:rdf="https://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="https://www.w3.org/2000/01/rdf-schema#" xmlns:rdfc="https://myexample.com/ns/rdfsCurieExtension#" xmlns="https://schema.org#" rdfc:hasDefaultScalarType="xsd:string" > <Person id="person:_JaneDoe"> <hasFullName>Jane Doe</hasFullName> <hasFirstName>Jane</hasFirsName> <hasLastName>Doe</hasLastName> <hasGender rdfc:resource="gender:_Female"/> <hasAddress> <streetAddress>123 Sesame St.</streetAddress> <city rdfc:resource="city:_NewYorkCity"/> <state rdfc:resource="state:_NewYork"/> <country rdfc:resource="country:_UnitedStates"/> <postalCode>01223</postalCode> </hasAddress> </Person> </rdf:RDF>
which is one of the few advantages to be gained by having everything in a single namespace, at least in my (somewhat minority) position. Regardless, there are advantages to working with XML, not least of which being the semantics are generally well specified (and specifiable), but given the disdain that the JSON community has for the language, the arguments there are probably moot.
JSON was the brainchild of Douglas Crockford (2007), who argued that the XML language was too verbose and syntactically different from the way that Javascript and other procedural language developers were used to working. Modeled on the Javascript object notation, JSON threw out most of the datatypes, namespaces and other content of XML and went to the more familiar (to developers) syntax of angle brackets and key:value pairs. XPath became replaced with the CSS selector language, XSD with JSON Schema, and ultimately XSLT would get replaced with a plethora of transformation tools, none of which have become remotely standardized. Nonetheless, in the wild world of the web, this was considered progress.
The problem that both XML and JSON face when dealing with the semantic web is a surprisingly difficult one to surmount. Both evolved in a world where you were describing a single document, one in which everything was effectively encapsulated. However, in the world of linked data, what was being linked were conceptual entities, ones that may be distributed across dozens or even hundreds of documents.
SGML, XML's predecessor, used to describe the concept of forests, each of which contained many, ultimately linked, documents, but over time this particular notion disappeared - until the need finally arose with XML (and then JSON) databases in the mid-2000s. The closed world assumptions that had been carried over from the relational world suddenly faced the reality of an open-world assumption that RDF utilized, one that replaced the "where" of a resource with the "what if" assumption. Specifically, if I create a link to a resource, does that resource need to "exist" (i.e., be resolvable) when I make that link. Relational databases and XML say "yes", RDF says "no", and JSON has a rather weak "maybe".
This is a philosophical conundrum, and arguably, most programmers are not, in general philosophers. Yet the answer to this conundrum has significant repercussions for those programmers. RDF describes an information space by creating a graph of assertions, or statements, that are held to be true within the confines of that graph. JSON can exist as an array of objects. XML technically cannot, though in practice (and some deep work in streaming) that's a limitation that is increasingly ignored.
The difference between a document and a stream is that a document perforce encapsulates its contents - you cannot resolve the meaning of the document until you have resolved the meaning of its underlying content to completion. A stream, on the other hand, is assumed to be open-ended. You never receive an "I'm done" message with a stream, only an "I'm done for the moment".
This streaming is now shaping two of the biggest potential candidates moving forward for how semantics will play out on the web in particular: JSON-LD and GraphQL.
It's actually pretty easy to serialize RDF as JSON, if you're comfortable dealing with URLs. The simplest solution is again to break things down by subject, predicate, and object:
// Listing 5. Simple JSON listing. [{"subject":"https://myexample.com/ns/person#_JaneDoe", "predicate":"https://myexample.com/ns/person#hasGender", "object":"https://myexample.com/ns/gender#_Female"}, {"subject":"https://myexample.com/ns/person#_JaneDoe", "predicate":"https://myexample.com/ns/person#hasFullName", "object":{value:"Jane Doe",datatype:"https://www.w3.org/2001/XMLSchema#string"} }]
However, this format is likely to send your average web developer into conniption fits, both because it's verbose and because it doesn't fit neatly into the Javascript paradigm of object and property. The JSON-LD (for Linked Data) format uses the context to define string patterns, with each pattern then feeding into subsequent patterns. Thi
// Listing 5. JSON-LD format for Jane Doe (partial) {"context":{ "Person": "https://myexample.com/ns/person#", "Gender": "https://myexample.com/ns/gender#", "xsd": "https://www.w3.org/2001/XMLSchema#", "janeDoe": {"@id":"Person:_JaneDoe", "@type":"Person"}, "gender":"Person:hasGender", "fullName":"Person:hasFullName", "female":"Gender:female" }, "graph":{ "janeDoe":{ "gender":"female", "fullName":"Jane Doe" } } }
The advantage of this format is that there's enough information within the format to be able to round-trip content back to a (JSON-LD aware) server while still being useful for the developer. You can essentially read the data using familiar dot notation:
// Listing 6. Accessing JSON LD Info let data = import("JaneDoe.JSON"); let graph = data.graph; let context = data.context; Object.keys(graph).forEach((key)=>{ console.log(key,graph.fullName,graph.gender,context[key][@type]) } => "janeDoe","Jane Doe","female","Person"
In theory, JSON-LD generates JSON that should make programming simpler. In practice, the attempts to deal with many potential use cases, the existence of four different profiles and the unpredictability of the kind of JSON that is being produced has led to some high profile failures that I think are indicative of its overall future.
If that wasn't enough, however, there's another technology that's coming into widespread use now: GraphQL. GraphQL started out life at Facebook as a way of being able to access information from the JSON based Facebook social graph. It was released as open-source, and while it is finding quite a bit of adoption with JSON databases, it is increasingly looking like semantic databases may end being the biggest beneficiary.
The reason for this has to do with the fact that RDF almost always incorporates a way to provide schema information about the data structures that exist within a given knowledge graph. This can be provided with RDF, with OWL, or with Shacl (covered in the next article), or can be derived from other schematic languages in use altogether. The key is that, regardless of the schematic underpinnings, it is possible to generate from a knowledge graph a corresponding schema template. This template can then be used with GraphiQL or similar clients to build queries, provide some kind of intellisense (API lookups), and then can generate JSON results. For instance, based upon the given template you could get a list of people that satisfy a given query (Listing 7)
Listing 7. A GraphQL Query {Query { person { id firstName lastName gender address { streetName city state zipCode:postalCode } } } } ==> Results {data:{ person [{ id: "janeDoe", firstName: "Jane", lastName: "Doe", gender: "female" address:{ streetName: "123 Sesame St." city:"New York City", state:"New York", zipCode: "00123" } }] } }
Three things make GraphQL powerful: First, the language completely abstracts knowledge about the back end data system - to the person using the graphQL client or similar CLI, the source is simply a black box. GraphQL takes a resource-oriented approach to data, which is very much in keeping with the philosophy behind semantic data.
Finally, it places no real demands to use a specialized interpreter or deal with awkward syntax on the part of the JSON being requested by the developer. The advantage that this provides cannot be understated. In effect, a developer can get the information that they need from the back end database and map it into their own application structures, rather than battling with the back end system to create data that meets their requirements.
GraphQL mutations work in a similar manner, transforming application data back into a canonical representation on the server. This requires a little more coordination, but once a bridge is established it means that there's less of an issue with ontological dissonance, while at the same time keeping the enterprise data model intact.
GraphQL has its limitations. It can handle absolute constraints well but doesn't do ranged queries, nor does it have the capacity to manage sorting data. In other words, it's unlikely to significantly displace the role of SPARQL. GraphQL is a shim, a way of converting a graphQL template query into some kind of SPARQL, and as such is more complementary than competitive.
What GraphQL does do is to make it easier for the non-semantically trained developers to get the data that they want as transparently as possible. This is a huge win because the complexity of working with graph databases has been a limiting factor in the adoption of the technology.
Articles in this series:
Kurt Cagle is the CTO of Semantic Data Group and the editor of #theCagleReport.
Linked Data Nerd (Retired ... Mostly)
4 年Note that JSON-LD and GraphQL actually address two different areas; JSON-LD is about data representation, while GraphQL is about querying (Turtle vs. SPARQL), and they are not completely disjoint (see GraphQL-LD: https://comunica.github.io/Article-ISWC2018-Demo-GraphQlLD/). The point of JSON-LD is to allow existing JSON representations to be interpreted as RDF by applying a context. In the latest version (JSON-LD 1.1), this can get to be quite complex, but in service of making an author or developer's life easier at the expense of a publisher needing to do more work). This allows names-acing issues and datatypes to be inferred on the underlying JSON representation, which as you note, has just a simple set of types (strings, numbers, booleans, arrays, and objects). GraphQL is emerging as a very important mechanism to query data sources, which may be effective document stores holding JSON documents which are queried by example, or devolved into an underlying basic graph pattern that can be applied to SPARQL endpoints, yielding the results as JSON, which of course, could be treated as JSON-LD.
Excellent article as usual. You made an intriguing remark: "which is one of the few advantages to be gained by having everything in a single namespace, at least in my (somewhat minority) position". Minority position, indeed, but I can imagine enough benefits to be interested in the full list. Can you please compare the two approaches? If you have done it already, I'd appreciate a pointer to that resource.
Founder and CEO at Pickathon
5 年I fully agree that GraphQL is going to be a core technology of how graph data finally transforms enterprise businesses (its already happening for businesses that are natively build on web services).?Some of the current limitations of GraphQL in terms of adding useful front end business logic (filtering, sorting, functions, etc) to the queries is left open to the development of the specific GraphQL interface, but it can be done.? Once enterprise ontologies are standardized based on end use cases (and made open as this is the only way it will get traction) this is where the real explosion of efficiency and federation between back end and front end web services will have transformative impacts on existing enterprise businesses that is even hard to imagine.? I'm excited to be part of this transformation for the semi conductor industry and I can say with confidence that we are only at the beginning stages.