Why SQL Is Not Right for Querying JSON

Why SQL Is Not Right for Querying JSON

Recently, creators of JSON databases have dealt with a fundamental problem. Simply storing and retrieving a given document by a specific key, while useful in any number of scenarios, is not so useful when people want to query that JSON content. This is not really that surprising - once you start gathering data into a database, one of the key requirements that emerge is the ability to query that data based upon specific properties.

Early on, the solution to this dilemma was to build indexes that would create specific look-up tables for certain key-value pairs. By ordering the indexes, you could also compare values for these keys, allowing you to sort within a range or to quickly order documents by their respective sort keys. 

The problem with this approach is that while it is a useful way to retrieve documents that contain those keys, it does nothing for actually retrieving specific fields, and gets really complicated when you have multiple keys that exist within specific sub-trees in a given document. 

In the XML space, there was, early on, a mechanism for retrieving a specific path from a document using a language called (not surprisingly) XPath. XPath has become the foundation for most XML technologies - XSLT and XQuery both are built on XPath expressions, as is the Schematron schema language. and the XML Schema Language itself is built upon the XML document model and implicitly uses XPath as its foundation, even though it works with this information in a typically decomposed way.

The JSONiq specification was proposed by a consortium of companies as a way of extending XPath so that it could also be applied to JSON structures. With JSONiq, you could specify the path

var givenNames = (for givenName in /individual/name/givenName )[(1 to 20)] order by givenName return givenName

to retrieve a list of the given name of the first twenty individuals in a database ordered alphabetically. As an addressing mechanism this solves a number of problems, but it also faces others, not least of which being that unless you can radically improve your index lookups, the performance for querying  in this matter can be poor. not something you want to defend while trying to get a leg up on the competition. 

Lately, there has been a spate of ostensibly NoSQL databases all rediscovering SQL as a way to get around these limitations. SQL is commonly known, of course, and this is typically cited as one reason for going to SQL as a query mechanism, especially when used to read specific properties in order to construct other JSON structures. 

I intend to make the case here, however, that moving back to SQL may end up biting these vendors in the butt sooner rather than later. One of the big reasons for this comes down to a very fundamental difference between hierarchical and relational (or, more relevantly, denormalized vs normalized) data stores.

In a relational system, everything is broken down into tables, and these tables are then held together by foreign keys pointing to primary keys. This process of building systems with reference keys and tables was very revolutionary for its time (mid 70s to mid-80s) and in many respects served a very real need - getting good performance out of queries that object-databases (which predated the relational model) simply couldn't handle. 

 In the simplest form of normalization (call it the zeroeth form) all of the information about a given entity can be given as a single table with multiple columns and rows. A person, for instance, could be represented with a given and surname, a street, city, state and zipcode (assuming a US-centric representation), a web address, a phone number and so forth.

Of course, reality tends to muck up such simple models. Assume for the moment that the person in question is a student, and things get more complicated - is the physical address information their location on campus, where they receive their mail, or where their parents live. You can, of course, extend the names of your columns to represent differing types, school_street vs home_street as an example, but this has the dual effect of adding to the number of columns and making the table sparser when much of this information is not relevant for most people. The more class variation you have, the more this process gets exacerbated. 

First normal form (also called compositional form, after its UML name) works by associating a key (typically a unique index or number) with each row of the independent entry ("Person") being given a primary key, and each row of the dependent entry ("Address") being given a foreign key field that indicates the associated primary. The Address table rows don't need to have their own primary keys, though there are in fact several benefits to doing so (useful if the address itself has a dependent entry being the big one).

First normal forms usually tend to occur as container/contained relationships - because a person may have more than one form of address, it's usually typical to think that the address is nested within the person entry. Both XML and JSON consequently make use of that nested first normal form, so that you may have an XML structure such as:

<table_Person id="jd101">
    <givenName>Jane</givenName>
     <surname>Doe</surname>
      <gender>F</gender>
       <table_Address id="addr1001" fk="jd101">
              <street>123 Sesame Street</street>
               <city>New York</city>
                <type>Home Address</type>
        </table_address>
        <table_Address id="addr1002" fk="jd101">
             <street>1313 Mockingbird Ln</street>
              <city>Arckham</city>
               <type>Dorm Address</type>
          </table_Address>
</table_Person>

The use of the "table_" notation here is intended to highlight the relationship between a table row (with the corresponding primary key given in the "id" attribute) and the associated foreign key attributes ("fk"). Now in reality, compositional entries almost never contain a foreign key reference, because it's implicit in being contained in the containing object, and surprisingly seldom contain a primary key reference, primarily because most people do not think about round-tripping or updating such content. This can be an expensive error to make when your XML or JSON information comes from a relational source.

Things get even more complicated when you move to second normal form. This is the classic case where you have many-to-many relationships. For instance, consider the case where you have students, teachers and classes. In a given school, every class will have one or more teachers, a teacher may teach more than one class, and students may be in multiple classes. Relationally, this is resolved by each of these entities having primary keys, then two composite entities tying teachers to classes and students to classes. Objectively, this is treated even in XML by decomposition:

<Schedule>
      <Class id="hw101">
              <title>Potions</title>
        </Class>
      <Class id="hw201">
              <title>Enchantments</title>
        </Class>
      <Class id="hw301">
              <title>Defense Against the Dark Arts</title>
        </Class>
      <Teacher id="mm">
              <name>Minerva McGonagle</name>
        </Teacher>
      <Teacher id="ss">
              <name>Severus Snape</name>
        </Teacher>
      <Teacher id="ad">
              <name>Albus Dumbledore</name>
        </Teacher>
      <Student id="hp">
              <name>Harry Potter</name>
        </Student>
      <Student id="rw">
              <name>Ron Weasley</name>
        </Student>
      <Student id="hg">
              <name>Hermione Granger</name>
        </Student>
         <StudentClass id="sc1">
                  <Student fk="hp"/>
                   <Class fk="hw101"/>
           </StudentClass>
         <StudentClass id="sc2">
                  <Student fk="hp"/>
                   <Class fk="hw102"/>
           </StudentClass>
         <StudentClass id="sc3">
                  <Student fk="rw"/>
                   <Class fk="hw101"/>
           </StudentClass>
         <StudentClass id="sc4">
                  <Student fk="rw"/>
                   <Class fk="hw103"/>
           </StudentClass>
         <StudentClass id="sc5">
                  <Student fk="rw"/>
                   <Class fk="hw103"/>
           </StudentClass>
         <StudentClass id="sc6">
                  <Student fk="hg"/>
                   <Class fk="hw101"/>
           </StudentClass>
         <StudentClass id="sc7">
                  <Student fk="hg"/>
                   <Class fk="hw102"/>
           </StudentClass>
         <StudentClass id="sc8">
                  <Student fk="hg"/>
                   <Class fk="hw103"/>
           </StudentClass>
         <TeacherClass id="tc1">
                  <Teacher fk="mm"/>
                   <Class fk="hw101"/>
           </TeacherClass>
         <TeacherClass id="tc2">
                  <Teacher fk="ss"/>
                   <Class fk="hw102"/>
           </TeacherClass>
         <TeacherClass id="tc3">
                  <Teacher fk="ss"/>
                   <Class fk="hw103"/>
           </TeacherClass>
         <TeacherClass id="tc4">
                  <Teacher fk="ad"/>
                   <Class fk="hw103"/>
           </TeacherClass>
</Schedule>

Now, while this is in XML (JSON would be similar) it's also worth noting that it is in a completely normalized form. There are clearly five tables with three entries per table, with the only thing not normalized being the enclosing (and largely meaningless) <Schedule> element. You could (and should) treat these as separate entities in their own right and lose the containing element altogether. 

Of course, this obviates one of the big advantages of JSON stores (and to a lesser extent, XML data stores) - the ability to retrieve structured content quickly. With XML this isn't as major a factor because most XML content is narrative in structure - if you were to walk the XML tree in a certain order, and take only the text content, the output will usually be a comprehensible string of text.  XML also allows for annotative metadata that JSON generally doesn't. Lost that narrative requirement, and it makes far more sense to normalize these down to their base entities.

Now, you can use SQL for this, but SQL has a number of strikes against it:

  • Most SQL solutions do not in fact retain any model information - particularly type - and cardinality has to be inferred (something difficult to do if you don't explicitly know structure).
  • Foreign key references usually depend on specific indexing, which adds to the overall memory and space utilization.
  • SQL by itself doesn't have the notion of an iterator; rather it relies generally upon a recordset interface. This means that while the outward form may be SQL-like, it's non-standard regardless.
  • SQL was never designed for referencing across the Internet (as it predates it by nearly ten years), nor does it have a specific serialization. 
  • SQL has no explicit extension capability - it is possible to write stored procedures, but these make for very non-portable implementations.
  • Most SQL solutions for JSON in particular do not incorporate a DDL. 

Given all of these factors, what would provide a better solution? In general, going with an XQuery/JSONiq solution for construction makes a great deal of sense - it is in fact designed to do the same set type operations that SQL can do, but was designed some twenty years later, at a time when modular design, accessibility of the Internet and multiple serialization forms were well established rather than largely theoretical.

In addition to this, SPARQL could be used in conjunction with XQuery/JSONiq. SPARQL is SQL plus twenty years, and was designed in part by the same people who had developed SQL two decades before. For instance, with a few minor changes, the schedule "database" could readily be queried via SPARQL. If you wanted a listing of both the classes that a given teacher has and the students in each class, you could do something like:

let $teacher-name := "Minerva McGonagle"
return
<schedule>
      <teacher>{$teacher-name}</teacher>
       {let $class-maps := sparqle:eval('select ?class ?title where {
             ?teacher  Teacher:name ?teacherName.
             ?tc     TeacherClass:teacher ?teacher.
              ?tc     TeacherClass:class ?class.
               ?class Class:title ?title }
                order by ?title',
                map:entry("teacherName":$teacher-name))
            return for  $class-map in $class-maps return
                  <class>
                         <title>{map:get($class-map,"title")}</title>
       {let $student-maps := sparqle:eval('select ?name where {
             ?class StudentClass:student ?student.
             ?student   Student:name ?name}
               order by ?name',
                map:entry("class":$class))
             return for $student-map in $student-maps return
                    <student>{map:get($student-map,"name")}</student>
                            }
           </class>}
</schedule>

This would then generate the XML report

<schedule>
       <teacher>Minerva McGonagle</teacher>
        <class>
                 <title>Enchantment</title>
                  <student>Harry Potter</student>
                   <student>Hermione Granger</student>
            </class>
</schedule>

Similarly, you could use very nearly the same code for working in Javascript and JSON:

var teacherName = "Minerva McGonagle"
var classMaps = sparql.eval('
       select ?class ?title where {

             ?teacher  Teacher:name ?teacherName.
             ?tc     TeacherClass:teacher ?teacher.
              ?tc     TeacherClass:class ?class.
               ?class Class:title ?title }
                order by ?title',{teacherName:teacherName})
var classes = forEach(classMaps,function(map){
             var title = map.title;
              var class = map.class;
             var studentMaps = sparqle.eval('
           select ?name where {

                 ?class StudentClass:student ?student.
                  ?student   Student:name ?name}
                   order by ?name',{class:class})
                var students = forEach(studentMaps,function(map){
                                                 return map.student})
                 return {class:{title:title,students:students}}
                 };
return {schedule:{teacher:teacher,classes:classes}};

The output of this would be even terser:
 {schedule:{
      teacher:"Minerva McGonagle",
       classes:[{class:{title:"Enchantments",
               students:["Harry Potter", "Hermione Granger"]}
        ]
}               

By the way, you can also use rdfxml and rdfjson notations. The rdfxml notation can in fact look remarkably similar to the xml format described above:

<rdf:RDF xmlns:rdf="https://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns="urn:hp#" xml:base="urn:hp#">
    <Teacher rdf:about="#mm">
        <name rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Minerva McGonagle</name>
    </Teacher>
    <Teacher rdf:about="#ss">
        <name rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Severus Snape</name>
    </Teacher>
    <Teacher rdf:about="#ad">
        <name rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Albus Dumbledore</name>
    </Teacher>
    <Student rdf:about="#hp">
        <name rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Harry Potter</name>
    </Student>
    <Student rdf:about="#rw">
        <name rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Ron Weasley</name>
    </Student>
    <Student rdf:about="#hg">
        <name rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Hermione Granger</name>
    </Student>
    <Class rdf:about="#hw101">
        <title rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Potions</title>
    </Class>
    <Class rdf:about="#hw102">
        <title rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Enchantments</title>
    </Class>
    <Class rdf:about="#hw103">
        <title rdf:datatype="https://www.w3.org/2001/XMLSchema#string">Defense Against the Dark Arts</title>
    </Class>
    <ClassTeacher rdf:about="#ct1">
        <class rdf:resource="#hw101"/>
        <teacher rdf:resource="#ss"/>
    </ClassTeacher>
    <ClassTeacher rdf:about="#ct2">
        <class rdf:resource="#hw102"/>
        <teacher rdf:resource="#mm"/>
    </ClassTeacher>
    <ClassTeacher rdf:about="#ct3">
        <class rdf:resource="#hw103"/>
        <teacher rdf:resource="#ss"/>
    </ClassTeacher>
    <ClassTeacher rdf:about="#ct4">
        <class rdf:resource="#hw103"/>
        <teacher rdf:resource="#ad"/>
    </ClassTeacher>
    <ClassStudent rdf:about="#cs1">
        <class rdf:resource="#hw101"/>
        <student rdf:resource="#hp"/>
    </ClassStudent>
    <ClassStudent rdf:about="#cs2">
        <class rdf:resource="#hw101"/>
        <student rdf:resource="#hg"/>
    </ClassStudent>
    <ClassStudent rdf:about="#cs3">
        <class rdf:resource="#hw102"/>
        <student rdf:resource="#hp"/>
    </ClassStudent>
    <ClassStudent rdf:about="#cs4">
        <class rdf:resource="#hw102"/>
        <student rdf:resource="#rw"/>
    </ClassStudent>
    <ClassStudent rdf:about="#cs5">
        <class rdf:resource="#hw103"/>
        <student rdf:resource="#rw"/>
    </ClassStudent>
    <ClassStudent rdf:about="#cs6">
        <class rdf:resource="#hw103"/>
        <student rdf:resource="#hg"/>
    </ClassStudent>
</rdf:RDF> 

The advantage to this approach is that it is readily parseable by most contemporary triple stores. The rdfjson format is not quite as standard, but is one that can be utilized in JSON stores such as MarkLogic (this is just a fragment):



Finally, the same output can be rendered as the RDF Turtle language, which showcases how terse and compact it is compared to the other formats:

@prefix rdf: <https://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix hp: <urn:hp#> .

hp:mm rdf:type hp:Teacher ;
hp:name "Minerva McGonagle" .

hp:hw103 hp:title "Defense Against the Dark Arts" ;
rdf:type hp:Class .

hp:ad rdf:type hp:Teacher ;
hp:name "Albus Dumbledore" .

hp:ct1 rdf:type hp:ClassTeacher ;
hp:class hp:hw101 ;
hp:teacher hp:ss .

hp:cs6 rdf:type hp:ClassStudent ;
hp:class hp:hw103 ;
hp:student hp:hg .

hp:hg rdf:type hp:Student ;
hp:name "Hermione Granger" .

hp:cs2 rdf:type hp:ClassStudent ;
hp:class hp:hw101 ;
hp:student hp:hg .

hp:cs1 rdf:type hp:ClassStudent ;
hp:class hp:hw101 ;
hp:student hp:hp .

hp:hp rdf:type hp:Student ;
hp:name "Harry Potter" .

hp:rw rdf:type hp:Student ;
hp:name "Ron Weasley" .

hp:cs5 rdf:type hp:ClassStudent ;
hp:class hp:hw103 ;
hp:student hp:rw .

hp:cs3 rdf:type hp:ClassStudent ;
hp:class hp:hw102 ;
hp:student hp:hp .

hp:ct3 rdf:type hp:ClassTeacher ;
hp:class hp:hw103 ;
hp:teacher hp:ss .

hp:ct2 rdf:type hp:ClassTeacher ;
hp:class hp:hw102 ;
hp:teacher hp:mm .

hp:ss rdf:type hp:Teacher ;
hp:name "Severus Snape" .

hp:hw101 hp:title "Potions" ;
rdf:type hp:Class .

hp:ct4 rdf:type hp:ClassTeacher ;
hp:class hp:hw103 ;
hp:teacher hp:ad .

hp:cs4 rdf:type hp:ClassStudent ;
hp:class hp:hw102 ;
hp:student hp:rw .

hp:hw102 hp:title "Enchantments" ;
rdf:type hp:Class .

Summary

  • Working with normalized content in both XML and JSON databases provides greater flexibility in terms of output structures, searching and semantic interrelationships than keeping larger data structures (though if you have the space, there's nothing stopping you from keeping both).
  • As we increasingly move towards using XML or JSON document stores for pure data storage (rather than narrative data), decomposed normal structures have more value in a number of ways, and can always be denormalized (transformed) prior to output, depending what kind of structure you need externally.
  • RDF - as XML or JSON - is normalized data. Not only can it be queried as such in XML or JSON data stores, but it can also be parsed as RDF and utilized by a triple store. 
  • Identifiers matter, especially when the information you are modeling moves beyond the first normal form. 
  • Neither XML nor JSON by themselves are necessarily good for joining across documents. XML has a a little used xlink: facility, but it is mostly ineffective for semantic linking. JSON doesn't even have that. However, by mixing xml or json with rdf (either through an internal store or via a service to an external triple store such as Jena) you get the benefits of SPARQL and the nearly thirty years of advances that have occurred since the advent of SQL.
  • Finally, SQL is not going to significantly evolve at this stage. JSONiq and XQuery are still malleable, and SPARQL - which was built in part with evolution to able to facilitate various formats - is barely even cooled yet. Standardization efforts are becoming inevitable - the NoSQL market has exploded of late, and interoperability (and the worry about vendor lock-in) are becoming factors in enterprise buying decisions. In many respects we are now at the stage with querying NoSQL that was the case back in the early 1980s when people were beginning to realize the need to standardize on something like SQL in the first place. 

Kurt Cagle is the founder and chief ontologist for Semantical, LLC.

Fabian Pascal

Editor &Publisher DATABASE DEBUNKINGS, Data and Relational Fundamentalist,Consultant, Analyst, Author, Educator, Speaker

9 年

Yes, but use-cases only as evidence that a theoretical solution is sound and sufficiently general.

回复
Christoph Bussler

VP Research and Development || Building high-performing teams || Creating exceptional software products || Empowering enthusiastic customers

9 年

Agree with Daniela. Use cases (in a way that covers the required spectrum) would be the starting point. SQL as we know it does not mean that this is all that can be done. There is NF2 work (done actually a while back), there is O/R extensions of SQL, etc. SQL is a syntax, not a semantics; given the semantics to be achieved, the syntax can take many forms, including SQL-style grammar. So SQL could actually be the right query language in the above sense, also given that you want to make all accessible to the infrastructure and knowledge around that.

回复

Kurt, what is "good" (or not) starts with the expectations, or simply put Use Cases. I think XQuery did a better job because it was a clear goal at we wanted to achieve and about the expectations of what this new query language should do. In the JSON world, I think vendors (and users too) are confused about what do their want their respective languages to do. After you figure that out, the rest can be done.....including a good language design AND achieving good performance.

回复
Kurt Cagle

Editor In Chief @ The Cagle Report | AI, Data Modeling

9 年

Hi, Frank! It made sense at the time. I'm reworking the latter part, because the article did definitely go off on a rather bizarre tangent. I think one of the things I'm trying to work out in my own head is the extent to which you CAN query denormalized content (XML, JSON) with a normal-form language (SQL certainly, SPARQL to a lesser extent), both from a performance and from a syntactic standpoint. This is actually a fairly serious problem now with one particular client (a big Minnesota based health insurance company that you're likely familiar with), in that the data ingest comes from normalized content, but there was a design philosophy (one I think we're moving away from thankfully) that said that we should be building deep nested content. When dealing with narrative content, this makes sense, but when your progenitor content is much more heavily relational, decomposing to minimally functional pieces makes more sense. Anyway, thanks for the response.

回复
Frank Rubino

Associate Partner @ IBM | Consulting on Data, AI and ML

9 年

Kurt- the first part of this is great and like another commenter I thought you might be heading in a slightly different direction; I get the late night meandering though- it's like you were thinking out loud and I don't mind that at all. From a large-scale architecture point of view it is difficult to provide performant results on semi-structured content like json and xml if the underlying model is tabular. Most solutions achieve polymorphic indexing by synchronizing the tables to other indexes. The overhead to the architecture is in making updates work in concert (so there's disaster recovery if all the updates have to be played back after a failure) and there's more overhead of course for synthesizing all the query results from the different components. This is exacerbated in distributed systems that have to provide reasonable SLAs (never mind unreasonable ones!) So when you actually dive into the level of effort of using a familiar tool (SQL) to do a job for which it is unsuited (query JSON) vs use the right data models and a variety of query languages (xquery + SPARQL) you have yet another compelling argument for the latter.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了