登录查看更多内容

Why SPARQL Is Poised To Set the World on Fire

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

发布日期: 2016年1月4日

In a recent Linked-In thread, Paul Houle made the comment that SPARQL has not yet set the world on fire. I agree with his statement, but I also think that there may be some very good reasons for why it hasn't ... and why many of those reasons are fading away.

Back in 2007, I had the opportunity to go to the IW3C2 Conference in Chiba, Japan. This conference is held every year in different parts of the world, and is an opportunity for members of the W3C to get together in confabs and meetings to discuss the development of web standards. I had the opportunity at the time to talk briefly with Tim Berners-Lee (TBL), but perhaps one of the more significant experiences for me was running into the developers working on the SPARQL standard. The language was intriguing, especially as I'd just begun to work with RDF on a regular basis.

RDF, the Resource Description Framework Language, by itself is simply a data format - a way of describing information that includes both atomic data (strings, numbers, dates and so forth) and links between data structures. At the time, RDF used a rather ungainly XML structure to describe this information, but David Beckett and TBL had also put together a proposal for another language called the Terse RDF Language (TRDL, hence Turtle) that provided another language for description. A few additional languages, such as the Manchester notation, have come about since then that are more functional in nature, but Turtle very quickly caught on as a way to concisely describe and incorporate assertions.

Additionally, a schema language of sorts (RDF Schema) had appeared earlier, though this was later superceded by the more comprehensive Web Ontology Language (or OWL), with a significant revision, OWL2, following in its wake in late 2009. These languages were useful for establishing modeling constraints on assertions - identifying the characteristics of classes, properties and similar relationships - but actually querying RDF was still remarkably different.

This is why SPARQL was so important. Without SPARQL, every triple store had to write its own language to query information, and this often involved complex and rather ungainly Java calls. SPARQL built upon the basic structure of Turtle, replacing specific subject, predicate or object parts with variables that could then be strung together to create inner joins and outer joins, along with joins on relationship predicates themselves.

The SPARQL 1.0 recommendation was approved in January 2008, less than half a year after my first encounter with the language, though it was only in 2010 that SPARQL began to show up in the then current crop of RDF triple stores.

Standards based development is always a hit and miss effort. At the time, the big stories were mainly focused on mobile, the first wave of AJAX applications were really just taking off, and JSON was beginning to contend for dominance against XML as the primary wire transfer format. Semantic triple stores were a comparative backwater, and were confined at the time primarily to academic and archival environments (there were a fair number of librarians who were trained in RDF usage, but even they were just beginning to come to grips with the new SPARQL language).

Additionally, as people began working with SPARQL, they discovered that performance was not as good as they were used to with relational databases (optimization still had a ways to go) and there were a number of things that people would have liked to have done with SPARQL - such as aggregated queries - that were simply not possible. Finally, while there was a CONSTRUCT statement in SPARQL that could be used to create new triples from old ones, there was no direct analog to a SQL DDL - a way of efficiently creating new data structures or schemas that could update the core set of triples in a data store.

The Semantic Web working groups went back to the drawing board and redesigned SPARQL to incorporate a number of new features, including the aforementioned aggregate query constructs, negative queries (the ability to detect when a pattern didn't exist in the database) and a better mechanism for parameterization and working with graphs. Additionally, the SPARQL UPDATE standard was extablished to both create new triples directly (through the INSERT DATA statement) and to insert, modify and delete triples via patterns. The SPARQL 1.1 and SPARQL UPDATE 1.1 standards were published in March 2013.

Historically, there has always been a lag of 2-3 years between the publishing of a standard and adoption of that standard by vendors or open source projects. It takes time to adapt existing products, to test those and get them into general use, and typically it is often the second major release of a technology where that adoption goes wide spread. Last year saw the introduction of SPARQL 1.1 in a number of new data tools, and it is very likely that you'll see major vendors such as Oracle or Microsoft releasing SPARQL compliant triple store products (either built on existing technology or more likely acquired) within the next couple of years as demand for these technologies grows (see ODATA4SPARQL as one potential direction).

What this means in practice is that even though RDF by itself is relatively mature, SPARQL as a query language is really just getting its legs. Performance was long an issue, but more triple stores are now roughly comparable in performance to other indexed stores, to the extent that certain types of computations (though by no means all) are in fact faster with triple stores than they are with relational databases, given similar CPUs and memory requirements.

There's another factor that makes a big difference. When your data structures are highly regular and mostly atomic, SQL makes a great deal of sense, especially when you're talking about data that exists within a single database. An XML or JSON structure has a folded hierarchy structure along with limited notions of sequences or arrays. These two cases encompass a significant number of use cases, and it's no surprise that most object models usually tend to fall into a fixed hierarchy. So long as all of your data structures can be encompassed as compositions (typically container/contained relationships), an XML or JSON structure can act as a database in its own right.

Where things get hinky is the case where you have associations - or queries across associations, especially when your data structures are not so regular. For a long time, these were edge cases - you could build any number of applications where these relational links could be waved away via the use of ad-hoc conventions, simply because they were rare enough that consistency didn't really matter.

However, that's now changing. Increasingly models are enterprise wide, and contain a high degree of variability for which SQL is just too cumbersome to handle well. Models are multidimensional - not only do they have the traditional entity relationship (ER) associations, but relationships can be inherited along subclassing or subproperty lines, data can have governance metadata associated with it (what's called reified data in semantic circles, where statements themselves have associated metadata such as author or production times), and class inheritance can establish properties that are difficult (if not impossible) to capture in ER Diagrams.

This is where RDF shines, because metadata is fundamentally referential. RDF is built around the concept that any data structure can be broken down into nodes and node pointers, where a node is either a conceptual identifier or an atomic value. What's more, such a system in theory should be independent of specific implementation. For instance, I can represent an employee record in Turtle:

employee:Dilbert
a class:Employee
employee:name "Dilbert"^^xs:string;
employee:gender gender:Male;
employee:supervisor employee:PointyHairedBoss;
.

The same information can be represented in JSON as:

[{"employee:Dilbert":{
"type":{"resource":"class:Employee"},
"name":{"value":"Dilbert","datatype":"xs:string"},
"gender:{"resource","gender:Male"},
"supervisor":{"resource","PointyHairedBoss"}
}
}]

and in XML (sans namespaces) as:
<classEmployee about="employee:Dilbert">
<employee:name datatype="xs:string">Dilbert</employee:name>
<employee:gender resource="gender:Male"/>
<employee:supervisor resource="employee:PointyHairedBoss"/>
</class:Employee>

As more and more data projects focus on the relationships (especially the associations) between objects within an enterprise model, RDF becomes more relevant. When you have several dozen (or several thousand) distinct classes of information, the ability to search comes down to the ability to identify type and type inheritance information, to traverse across relationships between multiple sets of objects and to identify contexts.

These types of problems are becoming more common in the big data space, especially as you start working across sets of ontologies defined by different organizations or divisions. As a consequence, the need for SPARQL, which allows you to make such queries, will grow in significance as well.

Put another way, RDF (and by extension SPARQL) becomes more important as the data models themselves become more complex, more associational, and more heterogeneous, simply because the variety of information will dominate over factors such as volume or velocity.

Relational databases do not work once you move outside of the context of the database itself. Collections of XML or JSON can be queried for content, but do not lend themselves well to interdocument queries. Each of these technologies work fine within their particular scale or scope, but again do not handle heterogeneous ontologies very well, except for very rare, specifically designed cases (such as XSLT or XForms).

This is one of the reasons I feel that the future for RDF (and graph databases in general) is quite bright. We are entering the era of shared data, and to do that, if we do not lower the impedance between models, the combinatoric explosion of transformations will prove an insurmountable problem. RDF/SPARQL is the more efficient set of tools to manage that.

Kurt Cagle is the founder of Semantical LLC, a data services company.

Alex M.

9 年

Does anyone agree though that it's much easier and more powerful for a developer to query and transform RDF data in an XML DB with xquery/xslt/xpath?

L. M.

Technology Consultant. Solution Design & Development. IoT, AI, Digital Transformation

9 年

I used and very much liked RDF as part of the data model for a commercial CMS. It felt like a very natural way to describe some of the data we were handling. I look forward to using it again with better performance.

Ihe Onwuka

XML RDF and Ontological Technologist

9 年

Hmmm. Kurt have you read this. https://www.bloorresearch.com/analysis/the-language-of-graphs/ ? Let me add to those points. I watched a Neo4j presentation a while back in which it was stated that they use Cypher because and I quote "Developers don't get SPARQL". I'm sure you know what happens today when (people say) developers don't get something but fear not if you don't (of course I'm sure you do) because it is something I have been meaning to write about for quite some time.

Andre Cusson

Knowledge Architect

9 年

Thank you Kurt Cagle for this interesting post and for highlighting the growing requirements for better association management. While RDF supports association and may possibly do so better or even more elegantly than SQL, for example, in some cases, RDF's support for association still seems rather primitive, especially to adequately support the effective representation and management of knowledge, rather than just information. While knowledge is not yet commonly represented, stored, and managed, it seems that requirements are bound to grow in that direction. Accordingly, it seems that by the time SPARQL may light the RDF-world fire, it could soon be time for replacement. More so, while RDF may handle the current "open-data", possibly as well as locked-in closed data, since RDF has no entitlement support provision, sharing data and especially knowledge data, where "sharing" may not be limited to "giving", could seemingly further constrain the SPARQL fire. Still, best of luck to SPARQL and its laborious process.

查看更多评论

要查看或添加评论，请登录

Kurt Cagle的更多文章

Reality Check

2025年2月22日

Reality Check

Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

14 条评论
MarkLogic Gets a Serious Upgrade

2025年2月15日

MarkLogic Gets a Serious Upgrade

Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

14 条评论
Beyond Copyright

2025年2月9日

Beyond Copyright

Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

5 条评论
Beware Those Seeking Efficiency

2025年2月8日

Beware Those Seeking Efficiency

Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

86 条评论
A Decentralized AI/KG Web

2025年2月1日

A Decentralized AI/KG Web

Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

48 条评论
Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

2025年1月26日

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

I am currently working on Deepseek (https://chat.deepseek.

41 条评论
The (Fake) Testerone Crisis

2025年1月15日

The (Fake) Testerone Crisis

Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

22 条评论
Why AI Agents Aren't Agents

2025年1月15日

Why AI Agents Aren't Agents

Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

22 条评论
What to Study in 2025 If You Want A Job in 2030

2025年1月12日

What to Study in 2025 If You Want A Job in 2030

Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

28 条评论
Ontologies and Knowledge Graphs

2025年1月9日

Ontologies and Knowledge Graphs

Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

53 条评论

See all articles

Why SPARQL Is Poised To Set the World on Fire

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

Kurt Cagle的更多文章

社区洞察

其他会员也浏览了

Using Requires Expression in C++20 as a Standalone Feature

Choosing the Right Collection Type in?Rust

Integrating MCP (Model Context Protocol) with LangChain4j to access GitHub

The Benefits of Using Generic Classes and Methods in C#

All Data and AI Weekly #177 - 17-Feb-2025

Smackdown: XML vs. JSON

C# Primitive Types and Variables

Data Serialization for Calling Services

Non-Atomic Access to Concurrently Updated Data Structures

New Memgraph Platform for Another Year of High Performance Graph Analysis

Kurt Cagle的更多文章

Reality Check

MarkLogic Gets a Serious Upgrade

Beyond Copyright

Beware Those Seeking Efficiency

A Decentralized AI/KG Web

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

The (Fake) Testerone Crisis

Why AI Agents Aren't Agents

What to Study in 2025 If You Want A Job in 2030

Ontologies and Knowledge Graphs

社区洞察

其他会员也浏览了

Using Requires Expression in C++20 as a Standalone Feature

Choosing the Right Collection Type in?Rust

Integrating MCP (Model Context Protocol) with LangChain4j to access GitHub

The Benefits of Using Generic Classes and Methods in C#

All Data and AI Weekly #177 - 17-Feb-2025

Smackdown: XML vs. JSON

C# Primitive Types and Variables

Data Serialization for Calling Services

Non-Atomic Access to Concurrently Updated Data Structures

New Memgraph Platform for Another Year of High Performance Graph Analysis