Graph Databases and Neo4j: A Stroll Down Memory Lane and Looking to the Future
A somewhat tongue-in-cheek graphic of AWS database options by Forrest Brazeal (@forrestbrazeal)

Graph Databases and Neo4j: A Stroll Down Memory Lane and Looking to the Future

I came across this great tongue-in-cheek graphic from Forrest Brazeal (@forrestbrazeal) the other day on AWS database options:

No alt text provided for this image

It's just a really fun yet still very informative graphic.

Two legs of this graph caught my eye:

  1. Managed Blockchain ??
  2. Neptune ??♂?

My Graph Database Journey

At the end of 2013, I had inherited the challenge of dealing with how to scale a document distribution engine that is used in the life sciences space to manage dissemination to and collection of documents from trial sites participating in clinical trials all around the world.

A big part of the problem was the highly connected nature of the data model which in a relational world, needed a schema design which was highly optimized for complex and numerous (sometimes recursive) JOIN operations OR would need to have a higher degree of duplication to reduce the need for those JOIN operations.

When the problem landed in my lap, the immediate solutions were to try to get quick wins from my experience working with SQL Server: faster storage, bigger "boxes", optimize file layouts, add application level caching, tune SQL Server memory management, etc. But without fundamentally rebuilding the schema, it would be impossible to achieve the target performance SLAs as the dataset grew.

I had seen a colleague working through Seven Databases in Seven Weeks and after some discussion, started working my way through my own copy.

No alt text provided for this image

Chapter 7 on Neo4j kicked off my graph database journey; it seemed like a perfect fit for the data schema challenge in front of me.

Why Neo4j?

I had only come across graph databases in passing via my computer science courses a decade earlier. In mid 2013 as I started my research, however, it seemed like a renaissance for graph or graph-like databases; there were multiple fledgling options including:

  1. TitanDB
  2. FlockDB from Twitter
  3. OrientDB
  4. Neo4j

And many others!

Given that of the batch we evaluated, only Neo4j seems to have had long-term success, I think we did alright! In fact, Neo4j recently raised the largest funding round of any database venture in history!

No alt text provided for this image

(My battle scarred gen1 vintage Surface Pro with Neo4j swag!)

Arturo Sevilla, one of my senior engineers at the time, spent almost 3 months just modelling our use case and performance testing Neo4j to see if it could work for us. Eventually, Arturo even ended up contributing the transaction management code for the open source Neo4jClient .NET drivers (as we needed them for our unit testing) and wrote this great article for Neo4j.

For me, there were three factors that tilted it in favor of Neo4j in those early days:

  1. Documentation. The Neo4j team really did a great job of putting together documentation and information was bountiful, even in those early days.
  2. Ease of Setup and Operation. Perhaps as a byproduct of the broad and deep documentation, it made it easy to set up and understand how to operate the database in a production scenario, even in those early days. It felt more like a finished product and you could see a path to a production rollout.
  3. Cypher. The absolute nail in the coffin has to be the Cypher query language (CQL); the real gem of Neo4j.

Cypher vs Gremlin vs SPARQL

In the graph world, you will commonly encounter two "mainstream" graph query languages.

  1. Gremlin: used in a variety of databases and first class query languages in Amazon Neptune and Azure CosmosDB.
  2. SPARQL: used in Amazon Neptune and a variety of other databases.

Gremlin never appealed to me due to how closely it maps to the academic concept in graphs; it feels a bit too raw. The query language focuses heavily on graph-laden terminology such as edges, vertices, in, and out paths. Here are some examples from a post by NebulaGraph:

No alt text provided for this image

You be the judge on which appeals to you.

SPARQL on the other hand felt like it tried very hard to layer relational database semantics on top of graph paradigms. It was also quite verbose and not well aligned with the emergence of JSON as a ubiquitous data serialization format. There's an article on DZone that has some great examples if you are interested.

Cypher stood out for its visual congruence to the underlying logical graph structure while its use of JSON for property maps made it easy for folks to reason about. Even complex queries in Cypher feel "light" and easy to reason about because of this characteristic. The Neo4j Sandbox is the perfect place to get a feel for it for the curious.

While we were very heads down working on implementing our product on Neo4j, I had come across rumblings of Neo4j opening up Cypher and dropping support for Gremlin. I am not familiar with the full details of the internal debates at Neo4j; I do recall reading at the time that there was a lot of discussion on whether it was the right path given the advantage that Cypher gave the Neo4j team versus the more open nature of Gremlin and thus wider adoption.

The outcome of those discussions was openCypher, but I think the long term impact of that transition will be far, far greater.

The Future of Graph Databases

As the world has moved on from relational databases to NoSQL and now to the new hotness of stream processing as data workloads increased and the demand for near-real-time analytics has grown, stream-based systems (not true "databases"?) such as Apache Spark have come to the forefront.

In 2019, the Apache Spark contributors voted to include a Cypher query implementation.

A video published by Databricks at Spark+AI in 2018, talks about the initiation of this integration.

And by 2019, it had become part of Spark 3.0:

The Neo4j team is now also at the core of the W3C GQL effort to create a standard graph query language. Given the strong influence of Neo4j and Cypher in shaping this standard, I have very high hopes for this effort!

The GQL project has a four-year timespan. Seven national standards bodies (those of the United States, China, Korea, the Netherlands, the United Kingdom, Denmark and Sweden) have nominated national subject-matter experts to work on the project

(Via Wikipedia)

Amazon's 2017/2018 acquisition of Blazegraph and subsequent release of Neptune, Microsoft's Azure CosmosDB's support for Gremlin and graph semantics, and Apache Spark's adoption of Cypher graph query support make me optimistic that graph databases will continue its arc of maturity and mainstream adoption.

Of course, I'm super excited about GQL because Cypher is by far my favorite query language.

Are Graph Databases for You?

All that brings me back to Forrest Brazeal's excellent graphic and his "No you don't" u-turn on Neptune ?? (Forrest, I would love to hear from you if you are reading this!).

The product my team built with Neo4j at the core went live in 2016 and I believe at the time, it was the only production Neo4j use case in clinical life sciences (might still be?). Today, it is still humming along as millions of documents are transacted each year and the customers onboard thousands of users across thousands of trial sites around the world. In retrospect, it is actually hard to imagine building it using a relational database!

I can only conjecture that part of the reason many folks ultimately end up dissatisfied with graph databases is due to the poor experience of working with both Gremlin and SPARQL; as Marie Kondo might say:

No alt text provided for this image

Graph itself is a novel concept for software and data engineers, but to have to wrap one's mind around Gremlin or SPARQL can make it difficult to see the appeal. My hope for GQL and it's influence from Cypher is that it can create a more approachable way to interact with graph databases.

Where graph databases shine are cases where relational solutions would require:

  1. Complex or high degree of JOIN operations where one would spend a lot of time optimizing those operations
  2. High degree of data duplication across the schema to avoid complex JOIN operations (and shifting some of the cost and complexity to managing the duplication)
  3. Complicated schema refactorings to optimize for the above

If you find yourself fighting this battle, it probably means that a graph database would be a very natural fit for your domain.

Neo4j and Cypher are not without their problems. For example, we have had to profile more than one query that ended up traversing every node in our database due to the nature of graph databases and oversight in bounding the graph traversals. (Perhaps the bright side is that even in production, this didn't crash the system, it just turned millisecond response times into 60 second response times).

However, the classes of problems that map intuitively onto a graph paradigm are as numerous as graphs are in our every day lives.

As some would say in Cypher: (graphs)-[:ARE]-(everywhere)

要查看或添加评论,请登录

??Charles Chen的更多文章

社区洞察

其他会员也浏览了