登录查看更多内容

Graph Databases and Neo4j: A Stroll Down Memory Lane and Looking to the Future

??Charles Chen

Principal Software Engineer | Maker: Turas.app + CodeRev.app

发布日期: 2021年7月14日

I came across this great tongue-in-cheek graphic from Forrest Brazeal (@forrestbrazeal) the other day on AWS database options:

It's just a really fun yet still very informative graphic.

Two legs of this graph caught my eye:

Managed Blockchain ??
Neptune ??♂?

My Graph Database Journey

At the end of 2013, I had inherited the challenge of dealing with how to scale a document distribution engine that is used in the life sciences space to manage dissemination to and collection of documents from trial sites participating in clinical trials all around the world.

A big part of the problem was the highly connected nature of the data model which in a relational world, needed a schema design which was highly optimized for complex and numerous (sometimes recursive) JOIN operations OR would need to have a higher degree of duplication to reduce the need for those JOIN operations.

When the problem landed in my lap, the immediate solutions were to try to get quick wins from my experience working with SQL Server: faster storage, bigger "boxes", optimize file layouts, add application level caching, tune SQL Server memory management, etc. But without fundamentally rebuilding the schema, it would be impossible to achieve the target performance SLAs as the dataset grew.

I had seen a colleague working through Seven Databases in Seven Weeks and after some discussion, started working my way through my own copy.

Chapter 7 on Neo4j kicked off my graph database journey; it seemed like a perfect fit for the data schema challenge in front of me.

Why Neo4j?

I had only come across graph databases in passing via my computer science courses a decade earlier. In mid 2013 as I started my research, however, it seemed like a renaissance for graph or graph-like databases; there were multiple fledgling options including:

TitanDB
FlockDB from Twitter
OrientDB
Neo4j

And many others!

Given that of the batch we evaluated, only Neo4j seems to have had long-term success, I think we did alright! In fact, Neo4j recently raised the largest funding round of any database venture in history!

(My battle scarred gen1 vintage Surface Pro with Neo4j swag!)

Arturo Sevilla, one of my senior engineers at the time, spent almost 3 months just modelling our use case and performance testing Neo4j to see if it could work for us. Eventually, Arturo even ended up contributing the transaction management code for the open source Neo4jClient .NET drivers (as we needed them for our unit testing) and wrote this great article for Neo4j.

For me, there were three factors that tilted it in favor of Neo4j in those early days:

Documentation. The Neo4j team really did a great job of putting together documentation and information was bountiful, even in those early days.
Ease of Setup and Operation. Perhaps as a byproduct of the broad and deep documentation, it made it easy to set up and understand how to operate the database in a production scenario, even in those early days. It felt more like a finished product and you could see a path to a production rollout.
Cypher. The absolute nail in the coffin has to be the Cypher query language (CQL); the real gem of Neo4j.

Cypher vs Gremlin vs SPARQL

In the graph world, you will commonly encounter two "mainstream" graph query languages.

Gremlin: used in a variety of databases and first class query languages in Amazon Neptune and Azure CosmosDB.
SPARQL: used in Amazon Neptune and a variety of other databases.

Gremlin never appealed to me due to how closely it maps to the academic concept in graphs; it feels a bit too raw. The query language focuses heavily on graph-laden terminology such as edges, vertices, in, and out paths. Here are some examples from a post by NebulaGraph:

You be the judge on which appeals to you.

领英推荐

How I created CrankDB and crossed 50k downloads!

Shrey Batra 1 年前

The Case for Shared Nothing

Ricardo Jimenez-Peris 3 年前

Lakehouse Concurrency Controls: Are we too optimistic?

Vinoth Chandar 3 年前

SPARQL on the other hand felt like it tried very hard to layer relational database semantics on top of graph paradigms. It was also quite verbose and not well aligned with the emergence of JSON as a ubiquitous data serialization format. There's an article on DZone that has some great examples if you are interested.

Cypher stood out for its visual congruence to the underlying logical graph structure while its use of JSON for property maps made it easy for folks to reason about. Even complex queries in Cypher feel "light" and easy to reason about because of this characteristic. The Neo4j Sandbox is the perfect place to get a feel for it for the curious.

While we were very heads down working on implementing our product on Neo4j, I had come across rumblings of Neo4j opening up Cypher and dropping support for Gremlin. I am not familiar with the full details of the internal debates at Neo4j; I do recall reading at the time that there was a lot of discussion on whether it was the right path given the advantage that Cypher gave the Neo4j team versus the more open nature of Gremlin and thus wider adoption.

The outcome of those discussions was openCypher, but I think the long term impact of that transition will be far, far greater.

The Future of Graph Databases

As the world has moved on from relational databases to NoSQL and now to the new hotness of stream processing as data workloads increased and the demand for near-real-time analytics has grown, stream-based systems (not true "databases"?) such as Apache Spark have come to the forefront.

In 2019, the Apache Spark contributors voted to include a Cypher query implementation.

A video published by Databricks at Spark+AI in 2018, talks about the initiation of this integration.

And by 2019, it had become part of Spark 3.0:

The Neo4j team is now also at the core of the W3C GQL effort to create a standard graph query language. Given the strong influence of Neo4j and Cypher in shaping this standard, I have very high hopes for this effort!

The GQL project has a four-year timespan. Seven national standards bodies (those of the United States, China, Korea, the Netherlands, the United Kingdom, Denmark and Sweden) have nominated national subject-matter experts to work on the project

(Via Wikipedia)

Amazon's 2017/2018 acquisition of Blazegraph and subsequent release of Neptune, Microsoft's Azure CosmosDB's support for Gremlin and graph semantics, and Apache Spark's adoption of Cypher graph query support make me optimistic that graph databases will continue its arc of maturity and mainstream adoption.

Of course, I'm super excited about GQL because Cypher is by far my favorite query language.

Are Graph Databases for You?

All that brings me back to Forrest Brazeal's excellent graphic and his "No you don't" u-turn on Neptune ?? (Forrest, I would love to hear from you if you are reading this!).

The product my team built with Neo4j at the core went live in 2016 and I believe at the time, it was the only production Neo4j use case in clinical life sciences (might still be?). Today, it is still humming along as millions of documents are transacted each year and the customers onboard thousands of users across thousands of trial sites around the world. In retrospect, it is actually hard to imagine building it using a relational database!

I can only conjecture that part of the reason many folks ultimately end up dissatisfied with graph databases is due to the poor experience of working with both Gremlin and SPARQL; as Marie Kondo might say:

Graph itself is a novel concept for software and data engineers, but to have to wrap one's mind around Gremlin or SPARQL can make it difficult to see the appeal. My hope for GQL and it's influence from Cypher is that it can create a more approachable way to interact with graph databases.

Where graph databases shine are cases where relational solutions would require:

Complex or high degree of JOIN operations where one would spend a lot of time optimizing those operations
High degree of data duplication across the schema to avoid complex JOIN operations (and shifting some of the cost and complexity to managing the duplication)
Complicated schema refactorings to optimize for the above

If you find yourself fighting this battle, it probably means that a graph database would be a very natural fit for your domain.

Neo4j and Cypher are not without their problems. For example, we have had to profile more than one query that ended up traversing every node in our database due to the nature of graph databases and oversight in bounding the graph traversals. (Perhaps the bright side is that even in production, this didn't crash the system, it just turned millisecond response times into 60 second response times).

However, the classes of problems that map intuitively onto a graph paradigm are as numerous as graphs are in our every day lives.

As some would say in Cypher: (graphs)-[:ARE]-(everywhere)

要查看或添加评论，请登录

??Charles Chen的更多文章

On Bakers, Ovens, and AI Startup Moats

2024年11月10日

On Bakers, Ovens, and AI Startup Moats

Summary Many folks who have not worked with LLMs to produce some complex output tend to see most AI startups as simple…

4 条评论
Turas.app - The Long Grind

2023年10月2日

Turas.app - The Long Grind

A few folks have reached out to Gary Chao and I regarding Turas.app and how we're doing and like any startup, it's been…

1 条评论
Interviews in the Age of AI: Ditch Leetcode - Try Code Reviews Instead

2023年7月31日

Interviews in the Age of AI: Ditch Leetcode - Try Code Reviews Instead

A Medium story first popped in to my feed last year and it got me thinking about the state of how the tech industry…

2 条评论
Turas.app - Survey Results, Improved UI/UX, and Launching Turas.app for Industry

2023年6月30日

Turas.app - Survey Results, Improved UI/UX, and Launching Turas.app for Industry

Breaking Down Our User Feedback Survey We recently sent out a user survey and we've got back our first set of…

2 条评论
Turas.app - Lessons Learned, Milestones, and more Dogfooding!

2023年6月10日

Turas.app - Lessons Learned, Milestones, and more Dogfooding!

More Dogfooding! Many of you already know that I took Turas with me on my own trip to Taiwan. Now Gary Chao is taking…

2 条评论
Turas.app - User Feedback, Monetization, and VCs

2023年6月3日

Turas.app - User Feedback, Monetization, and VCs

User Feedback If you've been following along with Turas.app, you already know that it started as a tool hacked together…

2 条评论
Turas.app - AI, Marketplaces, and Rejection

2023年5月27日

Turas.app - AI, Marketplaces, and Rejection

This week we're talking: Experiments in AI including Google Bard and how we're integrating AI into Turas Product…
Turas.app - Reddit, Maps, SEO, and More!

2023年5月21日

Turas.app - Reddit, Maps, SEO, and More!

What a week it's been! On Monday, we shared Turas.app on Reddit's r/InternetIsBeautiful where it held the top spot for…

1 条评论
The Case for C# and .NET

2021年11月1日

The Case for C# and .NET

It has been interesting as I’ve shifted out of .NET ecosystem which I’ve worked with on the server side (and some…
4 Tips to Short Circuit "Never Ending Interviews"

2021年8月4日

4 Tips to Short Circuit "Never Ending Interviews"

While working on Zytonomy → Thinktastic Software, I have taken a handful of interviews for opportunities which seemed…

See all articles

Graph Databases and Neo4j: A Stroll Down Memory Lane and Looking to the Future

??Charles Chen

Principal Software Engineer | Maker: Turas.app + CodeRev.app

My Graph Database Journey

Why Neo4j?

Cypher vs Gremlin vs SPARQL

领英推荐

The Future of Graph Databases

Are Graph Databases for You?

??Charles Chen的更多文章

社区洞察

其他会员也浏览了

#96 The Pursuit of the Holy Grail: Balancing Reads and Writes in Vector Databases

Candlestick Pattern Analysis with MongoDB Vector?Search

Understanding the CAP Theorem and its No Relationship to Scalability

The Hidden World of Databases: What Silicon Valley Doesn't Want You to Know ??

Internet Age Computing, Data and Spark

Exploring CAP Theorem: ACID Properties vs CAP Trade-offs

Real time data processing: easily processing 10 million messages with Golang, Kafka and MongoDB

Observations for Working with NoSQL Technologies

MongoDB Operators

MongoDB Query Operations: A Detailed Guide

My Graph Database Journey

Why Neo4j?

Cypher vs Gremlin vs SPARQL

领英推荐

The Future of Graph Databases

Are Graph Databases for You?

??Charles Chen的更多文章

On Bakers, Ovens, and AI Startup Moats

Turas.app - The Long Grind

Interviews in the Age of AI: Ditch Leetcode - Try Code Reviews Instead

Turas.app - Survey Results, Improved UI/UX, and Launching Turas.app for Industry

Turas.app - Lessons Learned, Milestones, and more Dogfooding!

Turas.app - User Feedback, Monetization, and VCs

Turas.app - AI, Marketplaces, and Rejection

Turas.app - Reddit, Maps, SEO, and More!

The Case for C# and .NET

4 Tips to Short Circuit "Never Ending Interviews"

社区洞察

其他会员也浏览了

#96 The Pursuit of the Holy Grail: Balancing Reads and Writes in Vector Databases

Candlestick Pattern Analysis with MongoDB Vector?Search

Understanding the CAP Theorem and its No Relationship to Scalability

The Hidden World of Databases: What Silicon Valley Doesn't Want You to Know ??

Internet Age Computing, Data and Spark

Exploring CAP Theorem: ACID Properties vs CAP Trade-offs

Real time data processing: easily processing 10 million messages with Golang, Kafka and MongoDB

Observations for Working with NoSQL Technologies

MongoDB Operators

MongoDB Query Operations: A Detailed Guide