You don't need a graph database: Modeling graphs and trees in PostgreSQL

You don't need a graph database: Modeling graphs and trees in PostgreSQL

In this article, we explore how Postgres, a powerful and versatile relational database, can be effectively used to model and traverse graphs and trees. While specialized graph databases exist, such as Neo4j, Postgres offers a robust alternative without the need for additional database systems.

We show how to implement common graph operations:

  • Getting all (transitive) neighbor nodes.
  • Detecting (and preventing) cycles.
  • Finding the shortest path (and all paths) between two nodes.

We will use simple normalized tables to model the graph. Recursive queries will play a key role for traversing the graph. Finally, we'll use advanced Postgres control structures for performance optimizations.

Definition of a graph

A graph has a set of nodes and a set of edges. An edge is a link between two nodes, establishing a relation between them.

Use-case examples

  1. Social networks. In such a graph, the nodes are persons and the edges represent the "is friend" relation.
  2. Documents in a hierarchical knowledge base. Such a graph consists of documents as nodes and the edges represent the "is subdocument" relation. In this way, permissions to documents can propagate to all its sub-documents, sub-sub-documents, etc.
  3. Traversing a knowledge graph powering a search engine, knowledge assistant or e-commerce product recommendations. The edges may represent "is related to", or something more specific like "is made of material".

More generally, anytime you have a many-to-many relation from a table to itself (via another table; the edge table) you have a graph.

Setup

Start a local Postgres instance:

docker run --rm -p 5432:5432 -e POSTGRES_PASSWORD=postgres postgres:15-alpine        

Access the DB:

psql postgresql://postgres:[email protected]:5432/postgres        

Create tables to model the graph:

CREATE TABLE node (
  id BIGINT PRIMARY KEY
);

CREATE TABLE edge (
  source_node_id BIGINT REFERENCES node(id),
  target_node_id BIGINT REFERENCES node(id),

  PRIMARY KEY (source_node_id, target_node_id),
  CHECK (source_node_id != target_node_id)
);        

We keep the graph simple. This models no particular use-case but demonstrates the fundamental principles. This is a directed graph since edges have a direction (source -> target).

You can imagine many additional properties of the graph. For example: Nodes and edges can be given a 'type' column. This may represent product categories for the nodes and different types of relations such as, "is instance of" or "is made of material", for the edges. Another example: The graph can be turned into a weighted graph by adding a 'weight' column to the edge table. This may represent the cost of utilizing an edge.

Insert example data:

INSERT INTO node (id) VALUES (1), (2), (3), (4);

INSERT INTO edge (source_node_id, target_node_id) VALUES
(1, 2),
(2, 3),
(1, 4);        


Traversing the graph

Before showing the general solution with recursive queries, we show the naive approach using joins. You can skip this section but it may be useful for your intuition.

Naive approach

Get all direct neighbor nodes to a given node:

SELECT target_node_id AS node_id
FROM edge
WHERE source_node_id = 1; -- Insert starting node here        
 node_id
---------
       2
       4
(2 rows)        

Get all 2nd degree neighbor nodes to a given node:

SELECT e2.target_node_id AS node_id
FROM edge AS e1
JOIN edge AS e2 ON e1.target_node_id = e2.source_node_id
WHERE e1.source_node_id = 1; -- Insert starting node here        
 node_id
---------
       3
(1 row)        

Get all 1st, 2nd, and 3rd degree neighbor nodes to a given node:

We can continue adding joins to go further. We can also combine the results to get all the nodes at distance 1, 2 and 3:

-- 1st degree neighbor nodes
SELECT target_node_id AS node_id
FROM edge
WHERE source_node_id = 1 -- Insert starting node here

UNION

-- 2nd degree neighbor nodes
SELECT e2.target_node_id AS node_id
FROM edge AS e1
JOIN edge AS e2 ON e1.target_node_id = e2.source_node_id
WHERE e1.source_node_id = 1 -- Insert starting node here

UNION

-- 3rd degree neighbor nodes
SELECT e3.target_node_id AS node_id
FROM edge AS e1
JOIN edge AS e2 ON e1.target_node_id = e2.source_node_id
JOIN edge AS e3 ON e2.target_node_id = e3.source_node_id
WHERE e1.source_node_id = 1; -- Insert starting node here        
 node_id
---------
       4
       2
       3
(3 rows)        

Now, let's write a query for the general case:

Using recursive queries to traverse the graph

WITH RECURSIVE traversed AS (
  SELECT
    target_node_id AS node_id,
    1 AS depth,
    ARRAY[source_node_id] AS path
  FROM edge
  WHERE source_node_id = 1 -- Insert starting node here

  UNION ALL

  SELECT
    edge.target_node_id,
    traversed.depth + 1,
    traversed.path || edge.source_node_id
  FROM traversed
  JOIN edge ON edge.source_node_id = traversed.node_id
)
SELECT node_id, depth, path || node_id AS path
FROM traversed;        
 node_id | depth |  path
---------+-------+---------
       2 |     1 | {1,2}
       4 |     1 | {1,4}
       3 |     2 | {1,2,3}        

If you want nodes at a specific depth, just add WHERE depth = {selectedDepth}

Recursive queries are very powerful. To understand them, let's look at how they are evaluated.

Understanding recursive queries

The term before UNION [ALL] is the non-recursive term, and the term after is the recursive term. Recursive queries are evaluated iteratively:

  1. Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.
  2. So long as the working table is not empty, repeat these steps:

  • Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.
  • Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.

See the Postgres docs for more details.

Preventing cycles

Let's insert one more edge to create a cycle (1 -> 2 -> 3 -> 1):

INSERT INTO edge (source_node_id, target_node_id) VALUES (3, 1);        

With this change, the previous recursive query will get stuck in an infinite loop. We can see the initial result by adding LIMIT 10:

 node_id | depth |       path
---------+-------+-------------------
       2 |     1 | {1,2}
       4 |     1 | {1,4}
       3 |     2 | {1,2,3}
       1 |     3 | {1,2,3,1}
       2 |     4 | {1,2,3,1,2}
       4 |     4 | {1,2,3,1,4}
       3 |     5 | {1,2,3,1,2,3}
       1 |     6 | {1,2,3,1,2,3,1}
       2 |     7 | {1,2,3,1,2,3,1,2}
       4 |     7 | {1,2,3,1,2,3,1,4}
(10 rows)        

We can stop before reaching a cycle by adding WHERE NOT edge.target_node_id = ANY(traversed.path):

WITH RECURSIVE traversed AS (
  SELECT
    target_node_id AS node_id,
    1 AS depth,
    ARRAY[source_node_id] AS path
  FROM edge
  WHERE source_node_id = 1 -- Insert starting node here

  UNION ALL

  SELECT
    edge.target_node_id,
    traversed.depth + 1,
    traversed.path || edge.source_node_id
  FROM traversed
  JOIN edge ON edge.source_node_id = traversed.node_id
  WHERE
    NOT edge.target_node_id = ANY(traversed.path) -- Stop on cycle
    AND traversed.depth < 100 -- Sanity check
)
SELECT node_id, depth, path || node_id AS path
FROM traversed;        
 node_id | depth |  path
---------+-------+---------
       2 |     1 | {1,2}
       4 |     1 | {1,4}
       3 |     2 | {1,2,3}
(3 rows)        

With a slight variation we can see the cycles explicitly:

WITH RECURSIVE traversed AS (
  SELECT
    target_node_id AS node_id,
    1 AS depth,
    ARRAY[source_node_id] AS path,
    FALSE AS is_cycle
  FROM edge
  WHERE source_node_id = 1 -- Insert starting node here

  UNION ALL

  SELECT
    edge.target_node_id,
    traversed.depth + 1,
    traversed.path || edge.source_node_id,
    edge.target_node_id = ANY(traversed.path)
  FROM traversed
  JOIN edge ON edge.source_node_id = traversed.node_id
  WHERE
    NOT traversed.is_cycle -- Stop on cycle
    AND traversed.depth < 100 -- Sanity check
)
SELECT node_id, depth, path || node_id AS path, is_cycle
FROM traversed;        
 node_id | depth |   path    | is_cycle
---------+-------+-----------+----------
       2 |     1 | {1,2}     | f
       4 |     1 | {1,4}     | f
       3 |     2 | {1,2,3}   | f
       1 |     3 | {1,2,3,1} | t
(4 rows)        

Enforcing no cycles in the data

It may suffice to ignore cycles when selecting, like above. But we may want to prevent cycles from ever being created, so that the data is really an acyclic graph (a tree).

This can be done with a CHECK constraint on the edge table. To do this, we will create a function that checks for cycles. The function 'has_cycle' works by checking "if an edge with the given 'source_node_id' and 'target_node_id' was to be inserted, would there be a cycle?".

First, a non-optimized implementation that keeps computing after the first cycle has been detected:

CREATE OR REPLACE FUNCTION has_cycle(input_source_node_id BIGINT, input_target_node_id BIGINT)
RETURNS BOOLEAN
LANGUAGE plpgsql AS $$
DECLARE
  result BOOLEAN;
BEGIN
  WITH RECURSIVE traversed AS (
    SELECT
      ARRAY[input_source_node_id] AS path,
      input_target_node_id AS target_node_id,
      false as is_cycle

    UNION ALL

    SELECT
      traversed.path || edge.source_node_id,
      edge.target_node_id,
      edge.target_node_id = ANY(traversed.path)
    FROM traversed
    JOIN edge ON edge.source_node_id = traversed.target_node_id
    WHERE NOT traversed.is_cycle
  )
  SELECT EXISTS (SELECT 1 FROM traversed WHERE target_node_id = ANY(path)) INTO result;

  RETURN result;
END;
$$;        

As an optimization, we can stop computing when the first cycle has been detected. We can use the 'FOR rec IN query LOOP statement END LOOP' construct. This construct allows us to tap into the row selection process. Each time a row is generated in 'query', it is assigned to 'rec' and 'statement' is executed. From 'statement' we can make an early exit and stop generating rows in 'query'. (See the Postgres docs for more details.)

CREATE OR REPLACE FUNCTION has_cycle(input_source_node_id BIGINT, input_target_node_id BIGINT)
RETURNS BOOLEAN
LANGUAGE plpgsql AS $$
DECLARE
  rec RECORD;
BEGIN
  FOR rec IN
    WITH RECURSIVE traversed AS (
      SELECT
        ARRAY[input_source_node_id] AS path,
        input_target_node_id AS target_node_id

      UNION ALL

      SELECT
        traversed.path || edge.source_node_id,
        edge.target_node_id
      FROM traversed
      JOIN edge ON edge.source_node_id = traversed.target_node_id
    )
    SELECT * FROM traversed
  LOOP
    IF rec.target_node_id = ANY(rec.path) THEN
      RETURN TRUE; -- Early return, stop looking when first cycle is detected
    END IF;
  END LOOP;

  RETURN FALSE;
END;
$$;        

We can now add add a CHECK to enforce that no cycles ever get inserted. First, we delete the row that violates the check:

DELETE FROM edge WHERE source_node_id = 3 AND target_node_id = 1;

ALTER TABLE edge ADD CONSTRAINT check_no_cycles CHECK (NOT has_cycle(source_node_id, target_node_id));        

The CHECK runs before a row is inserted, both 'source_node_id' and 'target_node_id' refer to columns of the row that is to be inserted.

If we now try to insert an edge that results in a cycle, we will get an error:

INSERT INTO edge (source_node_id, target_node_id) VALUES (3, 1);
-- ERROR:  new row for relation "edge" violates check constraint "check_no_cycles"        

Finding the shortest path

Using the principles of the 'has_cycle' function, we can write a function that finds the shortest path between two nodes, using the same early-stopping optimization:

CREATE OR REPLACE FUNCTION shortest_path(input_source_node_id BIGINT, input_target_node_id BIGINT)
RETURNS BIGINT[]
LANGUAGE plpgsql AS $$
DECLARE
  rec RECORD;
BEGIN
  FOR rec IN
    WITH RECURSIVE traversed AS (
      SELECT
      ARRAY[edge.source_node_id] AS path,
      edge.target_node_id AS node_id
    FROM edge
    WHERE edge.source_node_id = input_source_node_id

      UNION ALL

      SELECT
        traversed.path || edge.source_node_id,
        edge.target_node_id
      FROM traversed
      JOIN edge ON edge.source_node_id = traversed.node_id
    )
    SELECT path || node_id as path FROM traversed
  LOOP
    IF input_target_node_id = ANY(rec.path) THEN
      RETURN rec.path; -- Early return, stop looking when first path is detected
    END IF;
  END LOOP;

  RETURN NULL;
END;
$$;        

Get all paths between two nodes:

CREATE OR REPLACE FUNCTION all_paths(input_source_node_id BIGINT, input_target_node_id BIGINT)
RETURNS SETOF BIGINT[]
LANGUAGE plpgsql AS $$
DECLARE
  rec RECORD;
BEGIN
  RETURN QUERY WITH RECURSIVE traversed AS (
    SELECT
      ARRAY[edge.source_node_id] AS path,
      edge.target_node_id AS node_id
    FROM edge
    WHERE edge.source_node_id = input_source_node_id

    UNION ALL

    SELECT
      traversed.path || edge.source_node_id,
      edge.target_node_id
    FROM traversed
    JOIN edge ON edge.source_node_id = traversed.node_id
    WHERE NOT edge.target_node_id = ANY(traversed.path)
  )
  SELECT path || node_id AS path
  FROM traversed
  WHERE node_id = input_target_node_id;
END;
$$;        

Let's test the two functions. First, we prepare some test data:

DELETE FROM edge;

DELETE FROM node;

INSERT INTO node (id) VALUES (1), (2), (3), (4), (5);

INSERT INTO edge (source_node_id, target_node_id) VALUES
(1, 2),
(2, 3),
(3, 4),
(4, 5),
(1, 3),
(2, 5);        


Now test the functions:

SELECT all_paths(1, 5);        
  all_paths
-------------
 {1,2,5}
 {1,3,4,5}
 {1,2,3,4,5}
(3 rows)        
SELECT shortest_path(1, 5);        
 shortest_path
---------------
 {1,2,5}
(1 row)        

Conclusion

As we've seen, plain Postgres works well for many graph queries.

By not introducing another database to one's stack one gets the benefits of maintaining only one database and avoiding syncing data and permissions across different databases. One will likely need Postgres even if one opts for a graph database to maintain basic application data.

We wrap up by giving you a hint of what else there is.

Beyond Postgres: The Cypher query language and GQL

The Cypher Query Language (see also GQL: Graph Query Language) can express the 'has_cycle' and 'shortest_path' queries much more succinctly:

-- Find cycle
MATCH path = (n)-[*]->(n)
WHERE length(path) > 1
RETURN path LIMIT 1

-- Find path
MATCH (start:Node {id: 'StartNodeID'}), (end:Node {id: 'EndNodeID'})
MATCH p = shortestPath((start)-[*]-(end))
RETURN p        

There are extensions for Postgres such as AGE (supports Cypher) and pgRouting that add powerful graph operations to Postgres. However, these are not that common and might not be available in your managed SQL database provider.

While dedicated graph databases are powerful, plain Postgres might be all you need.

In this article, we used the methods:

From someone that scaled a postgres graph database with 7+ billion edges under a low budget, I'd also add not starting with BIGINT for the `node` id. If there is not a need to make the id of `node` a BIGINT - don't. Especially for social networks or any similar use cases where you'll have a lot of different connections between nodes - edges outweigh the nodes in a drastic ratio hence with BIGINT you are only slowing down the queries made. There are also other changes you can make to the setup to make it even more cost efficient, but that depends on the use case and business?logic..

Vikash Pandey

Senior Software Developer at J&F

9 个月

Viktor Qvarfordt Concern About Recursion Slowing Down Query?

回复
David 'DC' Collier

Gen AI Engineer / Product Designer

11 个月

thanks for this and the details around preventing cycles! Have you considered a design for a lightweight ORM type wrapper for these type of queries, to hide a bit of this hairy SQL? something like drizzle-orm, but for more graph like syntax. I don't even like writing cypher because it's mostly just ascii with no typing/intellisense. For 80% of basic queries - pathfinding, Knn etc a nice little DSL could be very useful on top of plain postgres.

Kevin Warner

Principal Software Architect, Engineer, and Consultant at Self Employed

1 年

Thanks for this. Very helpful. However, how would I change the recursive query to be "inclusive" of the starting node? For example, I want to see the entire tree starting at node 1, but include node 1 in the results?

Petter Strandmark

R&D Manager at Apple | ex-Google

1 年

You can indeed do worse than "Postgresql until proven impossible"

要查看或添加评论,请登录

Viktor Qvarfordt的更多文章

  • LLM Agents: Overview and implementation

    LLM Agents: Overview and implementation

    What do we exactly mean by an agent? And how do we implement them? This is what we will explore in this article. LLM…

    7 条评论
  • Postgres data isolation and Row Level Security

    Postgres data isolation and Row Level Security

    What? When building SaaS systems, customer data isolation is essential. This can be achieved in different ways, with…

    3 条评论
  • Cross-encoder for vector search re-ranking

    Cross-encoder for vector search re-ranking

    In RAG applications, cross-encoders are used for re-ranking vector search. How? First, one uses vector search to get…

    6 条评论

社区洞察

其他会员也浏览了