A not-so-good idea: Pipe Syntax In SQL
Colors from The Treachery of Images (René Magritte)

A not-so-good idea: Pipe Syntax In SQL

Many SQL users have expressed frustration with the SQL query syntax for SELECT. They argue that beginning with the FROM clause and ending with the SELECT clause would be more intuitive. This sequence would allow developers to specify the tables they want to query before using their column names, aligning more closely with their thought processes. Additionally, this order would be more compatible with auto-completion features, as it cannot suggest any column names before knowing which tables are being queried.

Some databases have implemented this, like DuckDB, and users seem to like it. In the following paper, Google described the new 'pipe syntax' for its services with a SQL-like API: SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL (research.google)

To get the idea, here is an example from their paper:

query 13 from the TPC-H benchmark, rewritten with SQL pipes

The pipe syntax has been implemented in GoogleSQL, which parses SQL statements in F1, BigQuery, Spanner, Procella, and ZetaSQL. These are not full-featured SQL databases. Would it make sense to use the same syntax for relational SQL databases?

I could embrace this approach for two reasons:

  • I like maximizing user experience, and users seem to love starting SELECT statements with 'FROM' rather than 'SELECT'.
  • I like using pipes on Linux shell or with JQ to parse and transform JSON and chaining the execution steps for better readability and troubleshooting.

However, this pipe syntax is terrible for writing clean SQL code. And that's not only because I need to type four keys on my Swiss keyboard to get a "|>." Let me explain why in a few points.

You can start writing FROM and continue with SELECT above it.

I understand the convenience of starting to write an SQL query using its FROM clause, especially with auto-completion. However, the final SQL code doesn't need to follow the same order. When writing code, I move around in different directions. For instance, with auto-completion, I still find myself typing characters like " { " or " ' ", which are expanded to " { } " or " ' ' " and then going back to complete them.

When it comes to procedural logic, I often write it before encapsulating it into a function with its name and arguments. This is not a reason to organize the language code differently and put what I wrote first in front. We don't write code, or even text, in a single direction as we did at the time of mechanical typewriters, where going back was painful. I have arrow keys, a trackpad, a touch screen, and a mouse.

The bottom-up approach, starting from the table to transform to a result, may be convenient for ad-hoc queries. Clean code may start with the application objects that define the result you expect from your query. Rather than listing the tables you want to query (FROM) and then transforming them to what you provide to the application (SELECT), you may start top-down with what is needed for your application. That's the SELECT clause, and the rest of the SQL query describes how to get it from the data structures.

You read SQL queries more often than you write them.

There's a good reason to start a function by its signature, name, input arguments, and returning type in all languages. The first thing you want to know about a function is how it interfaces with your code. You only look at the function body when you want to understand how it works.

The same is true for an SQL query. The SELECT clause defines what it returns to your program: the columns of the tabular result, with their names. Those names are meaningful for the result and may differ from the table's columns derived from.

Could you look at the query above using pipe syntax? How would you guess the structure of the result from it? Where do you add comments to describe it?

Here is how I would write it in SQL (keeping their awful column names and uppercase keywords):

WITH
"c_orders" ( c_custkey , c_count ) AS (          
 -- For each customer (c_custkey),
 --  calculate the number of orders (c_count)
           SELECT
                  c.c_custkey
                , COUNT(o.o_orderkey)
          FROM customer 
          AS c
          LEFT OUTER JOIN orders AS o  
                ON c.c_custkey = o.o_custkey
                AND  o.o_comment 
                          NOT LIKE '%unusual%packages%'
          GROUP BY 
                c.c_custkey
    )
-- orders, and the number of customers who have that many 
SELECT
      c_count
    , COUNT(*) AS custdist
FROM 
      "c_orders"
GROUP BY 
      c_count
ORDER BY 
      custdist DESC
    , c_count DESC        

I defined two logical steps. First, I aggregated the customers' orders to return a single count per customer. Then, I aggregated those counts to show how they were distributed.

The first step is the input of the second one, and you can write it like I did with Common Table Expressions or nested subqueries. I like CTEs because they name each intermediate result and simplify debugging by querying only the intermediate result. I'll give a better example later.

SQL operates on sets. Pipes operate on streams.

Let's show a typical example of pipes. In Linux, the uniq command deduplicates the input.

{
   echo 'Hello'
   echo 'World'
   echo 'Hello'
   echo 'Reader'
}     |    uniq

Hello
World
Hello
Reader        

If I had written the input in a different order, the result would have been different:

{
   echo 'Hello'
   echo 'Hello'
   echo 'World'
   echo 'Reader'
}     |    uniq

Hello
World
Reader        

A pipe is a FIFO queue (First-in, First-out) that reads its input in a specific order and returns its output as an ordered stream. The whole idea of pipelining is that one processes and reads the input without knowing the entire set.

SQL works on row sets that have no order. The processing order is an implementation detail not visible to the user, which is determined later during query planning. SQL doesn't define a data flow (like in Kafka KQL, for example) but declares operations on data sets.

The distinct keyword deduplicates the input in SQL.

WITH "input"(line) AS ( VALUES 
            ('Hello')
           ,('World')
           ,('Hello')
           ,('Reader')
) SELECT DISTINCT line FROM "input";

  line
--------
 World
 Hello
 Reader
(3 rows)        

You can re-arrange the input values, and the result will be the same. SQL operates on sets, like mathematical sets, which have no ordering. The intermediate result is not pipelined between two subqueries and doesn't depend on the processing order. The syntax must not give the false impression that there's a data stream between subqueries.

There's only one clause that may be related to a pipe, and it is the ORDER BY clause. It defines how the result set must be ordered for the application to fetch it because the network protocol is FIFO, and the target structure in the application may not be a set but an ordered list. The only pipeline in SQL is the final result set between the database and the application.

Representing the dependency between subqueries as pipes may sound familiar in simple queries, but it leads to an incorrect understanding of how SQL is processed. Many SQL users already forget an ORDER BY because they think the results come in a deterministic order when they define an index to accelerate a range query. It will be worse if they feel that rows go through FIFO pipes, which are supposed to preserve the order.

SQL declares in FROM the role of the table in the query.

The query example in Google's paper starts as:

FROM customer
|>        

This suggests that the "customer" table is this query's primary data source. However, the query concerns orders and counts them per customer only to calculate some statistics. The table "customer" is joined to get a zero count for customers without orders. With the proper indexes, there's a good chance that the query optimizer will decide to start with "orders" and join it with a Right Outer Join to "customer".

What is worse is that this clause names a table without an alias and puts that in front as if this table will be referenced in the rest of the query. It is not. An implicit alias is created and referenced later in WHERE and the SELECT clause. Not using aliases is a bad practice.

A query can read from "customer" as the customer who created the order. However, it can also be the customer to whom the order will be delivered. It may not be the same: one customer can order some products to be delivered to another one. My query can show both, for example, listing the customers who ordered for another one but within the same city. So the FROM clause will JOIN the "cities" table but, again, must define their roles:

SELECT  *
FROM customer  AS "ordering_customer"  ON (...)
JOIN city  AS "ordering_customer_city"  ON (...)
JOIN customer  AS  "delivering_customer"  ON (...)
JOIN city  AS "delivering_customer_city" ON (...)	
WHERE 
    "ordering_customer".id
     !=    "delivering_customer".id
AND
    "ordering_customer_city".name 
      =   "delivering_customer_city".name        

The pipe syntax suggests that we navigate from one table to one table. Unfortunately, none of the examples in Google's paper use aliases for the tables they read.

Only simple queries where each table has only one role can be written without an alias, which isn't pleasant for readability and evolution. Imagine you start writing a query joining an airline, a flight, and an airport. It would be best to immediately document that you joined the operating airline and the destination airport rather than leaving the table names as implicit aliases. One day, you will add the originating airport and have to put aliases everywhere.

SQL doesn't read the rows of the table declared in the FROM clause.

If I tell you that my "customer" table has 1000 columns, what do you think of the performance of the following query using the pipe syntax?

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey        

The pipe syntax sounds familiar because it is widely used in Linux with files. It reads the file on the left and processes its rows while reading.

The FROM clause in SQL is different. It lists logical structures, such as tables, views, functions, subqueries, or CTEs, and declares their role in the SQL query. There's also a good chance that the table you mention in the FROM clause is not read at all because there's an index that stores its data.

Let's discuss the semantic evaluation order mentioned in the paper. I hope the WHERE clause's most selective part is evaluated before the FROM clause to read a range from an index rather than scanning the table from the beginning to the end and filtering. Additionally, this index scan can evaluate the ORDER BY right after the WHERE index condition to get the index entries in the expected order and avoid the need for a sort. You don't want to evaluate the WHERE and ORDER BY after the FROM and JOIN for pagination or Top-N queries. When defining your columns, you can only build the best indexes for your queries if you consider this semantic order. Most indexing failures occur because of incorrect thinking about the SQL clause evaluation order.

Many users code SELECT * or list all columns in the SELECT clause because they think the performance is the same on a database that stores table rows. They will not benefit from covering indexes. The example query in Google's paper reads the "customer" table only to count the "c_custkey" column. You don't need to read the whole row. The SELECT clause determines what is read, not the FROM clause. The FROM clause maps the aliases you used in previous clauses to the objects known by the database catalog. That's another reason to put FROM after SELECT.

If the user starts her query like the one proposed with the pipe syntax, she will think that the physical "customer" table is read, and its rows are piped to a join operator for the join processing. This is not how it works. Only the execution plan determines where the query starts, which object it reads, and the data flow between operations. SQL syntax must not include directional operators like |> to avoid confusion.

There's no defined data flow before the query planner parses SQL.

This pipe syntax originates from NoSQL databases, where there is no optimizer. SQL databases provide data independence so that the application and the database implementation can evolve independently. The SQL declaration is parsed, and the query planner interprets it with the catalog to build the execution plan.

Here is an example of an execution plan:

                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------
 Sort
   Sort Key: (count(*)) DESC, c_orders.c_count DESC
   ->  GroupAggregate
         Group Key: c_orders.c_count
         ->  Sort
               Sort Key: c_orders.c_count DESC
               ->  Subquery Scan on c_orders
                     ->  GroupAggregate
                           Group Key: customer.c_custkey
                           ->  YB Batched Nested Loop Left Join
                                 Join Filter: (customer.c_custkey = orders.o_custkey)
                                 Sort Keys: customer.c_custkey
                                 ->  Index Only Scan using customer_c_custkey_idx on customer
                                 ->  Index Scan using orders_o_custkey_o_comment_idx on orders
                                       Index Cond: (o_custkey = ANY (ARRAY[customer.c_custkey, $1, $2, ..., $1023]))
                                       Filter: (o_comment !~~ '%unusual%packages%'::text)
        

I would be delighted if someone proposes a standard way to display an execution plan, and pipes may have their place, like this, for example:

Index Only Scan using customer_c_custkey_idx on customer
|> YB Batched Nested Loop Left Join
              Index Scan using orders_o_custkey_o_comment_idx on orders
              Index Cond: (o_custkey = ANY (ARRAY[customer.c_custkey, $1, $2, ..., $1023]))
              |> Filter: (o_comment !~~ '%unusual%packages%'::text)
     Join Filter: (customer.c_custkey = orders.o_custkey)
     Sort Keys: customer.c_custkey
|> GroupAggregate
     Group Key: customer.c_custkey
|> Subquery Scan on c_orders
|> Sort
     Sort Key: c_orders.c_count DESC
|> GroupAggregate
     Group Key: c_orders.c_count
|> Sort
     Sort Key: (count(*)) DESC, c_orders.c_count DESC        

This exercise already shows the limits of the pipe syntax, even when the data flow is entirely determined. Contrary to pipes, SQL operations can have multiple input branches. You need either parenthesis (like in SQL subqueries), named table expressions (the WITH clauses), or indentation (the execution plan). I hope nobody will try to avoid this by using an equivalent of the HP-35 Reverse Polish notation ?? How would you describe a Bushy Join with pipes?

User-friendly query builders start with SELECT.

The best query builder I've seen was in 1995 when I discovered BusinessObjects v3.0, the first BI tool (which was not called BI but Infocenter). The beauty of it was that the data warehouse designer created the database tables and indexes and mapped them to a 'universe' of business objects. The end-user selected those business objects. The list of the business objects served as the SELECT clause, with aliases, expressions, aggregates, and a name. That's where the end-user would start their query: with the SELECT clause.

The tool generated all other parts of the SQL query because the BusinessObjects universe mapped the JOIN between table aliases or conditional WHERE clauses behind the objects. The objects had a role in the query context, like in my example above, with the airport of origin and the airport of destination, similar to SQL aliases.

In SQL, you declare the desired result, starting with the SELECT clause, and define the aliases you have used with other clauses. These are declarative definitions, not procedural code that describes a data flow. Databases allow the influence of the execution with optimizer hints, which are in special comments enclosed in /*+ */ because they cannot be part of the declarative language.

I'm referring to SQL designed for relational databases instead of the SQL-like languages created by NoSQL databases to resemble SQL databases. This NoSQL SQL-like API needs to begin with the tables or indexes to be queried due to the absence of a catalog allowing logical/physical data independence. Additionally, it must describe the data flow because there's no optimizer to determine the most efficient access path to the data. The pipe syntax makes sense for this pseudo-SQL API.


The pipe syntax described in Google's paper attempts to adapt the SQL language for NoSQL databases. Although it may seem appealing due to its simplicity, it's similar to the attraction of NoSQL, which was popular 15 years ago: it looks simple to start but is very limited compared to SQL.

If you find SQL to be challenging, here are some tips to help you:

  • SQL operates on sets of rows. The table specified in the FROM clause is called a 'table' because it has a tabular format consisting of columns and rows. You cannot guess the physical access without looking at the execution plan. It is essential to SELECT only what you need rather than reading a table and fetching all the columns. Starting with the SELECT clause makes it straightforward.
  • A set of rows in SQL has no inherent order except when presented to the application according to the final ORDER BY. While the database system may know the order during execution and avoid a sorting operation, this is visible in the execution plan, not the SQL query. It is essential to add an ORDER BY when you expect a sorted result and not rely on the apparent order. Each row set is defined as subqueries between parentheses without any dataflow direction.
  • The table referenced in the SQL query is not directly read as defined in the FROM clause. The structures referenced in the SQL query are logical, and it's up to the database to determine which columns and rows need to be read, whether from a table, an index, or a materialized view. When multiple tables are listed in the FROM clause, there's no predefined execution order.
  • The table name in the FROM clause should not be used elsewhere. Please always add an alias to qualify its role in your query, and then use all column names prefixed with the alias. Try to start writing your SQL query using the SELECT clause, with the columns you need and their alias from their role in the logical data model. Then, you will map the required aliases to tables in the FROM clause. If you prefer to define those aliases first, use a WITH clause.
  • When writing SQL queries, aim for readability and adaptability. Utilize Common Table Expressions (CTEs) for modularity and ensure that the SELECT clause matches the application's intended structure regarding columns and datatypes. Additionally, query only the necessary columns, as this can optimize performance by allowing covering indexes to serve the query more efficiently.

Let's take an example. Here is the functional description of the query given in the paper's example: provide the number of orders per customer and the number of customers with that number of orders, counting all orders except unusual packages.

Here is how I write this using a top-down approach. I start with the result I want, the SELECT with aliases, then detail the definitions of those aliases down to the names of objects stored in my database.

SELECT
      "orders per customer"."number of orders"
     	AS "number of orders per customer"
     , COUNT(*)  
       AS "number of customers"
FROM
        (
              SELECT 
                     "customer".c_custkey
                      AS "customer"
                   , COUNT("customer order".o_orderkey)
                     AS "number of orders"
              FROM customer
                 AS "customer"
              LEFT OUTER JOIN orders 
                AS "customer order"
              ON "customer".c_custkey 
                      = "customer order".o_custkey
                AND "customer order".o_comment 
                          NOT LIKE '%unusual%packages%'
              GROUP BY "customer".c_custkey
        ) 
  AS "orders per customer"
GROUP BY "number of orders"
ORDER BY
    "number of customers" DESC
  , "number of orders per customer" DESC
;        

When I want to start with the FROM clause, here is how I write this using a bottom-up approach. I start with my data structures, qualifying their role in my query, and transform them into the result I want

WITH
"customer"  AS (
    SELECT c_custkey 		AS "customer"
    FROM customer
),
"order"  AS (
    SELECT 
        o_custkey     AS "customer"
      , o_orderkey   AS "order"
    FROM orders
    WHERE o_comment 
                  NOT LIKE '%unusual%packages%'
),
"customer orders"  AS (
   SELECT
         "customer"
         , COUNT("order")  AS "number of orders"
   FROM "customer"
     NATURAL JOIN "order"
   GROUP BY
       "customer"
), 
"number of orders per customer"  AS (
   SELECT
       "number of orders"
     , COUNT(*)  AS "number of customers"
   FROM
     "customer orders"
   GROUP BY
     "number of orders"
)
SELECT 
    "number of orders" 
  , "number of customers"
FROM 
    "number of orders per customer"
ORDER BY
    "number of customers" DESC
  , "number of orders per customer" DESC        

Please feel free to use your style, and do not follow mine. I prefer to use quoted aliases for readability. I use natural joins when I fully control the column names (projecting with aliases before joining). I also like to name each subquery, either using an alias when it's a subquery or the CTE name when in the WITH clause. Additionally, I leave space to add comments if the naming conventions are not fully descriptive.

If you have access to the pipe syntax and it makes your life easier, feel free to use it. However, it's important to remember that the apparent direction of the |> syntax is incorrect. Also, starting your query with a FROM clause does not necessarily define how your query will be executed. Please ensure you and your colleagues know about this since they must read your queries.

Finally, given that this is a controversial subject, let me add one more subjective comment: the pipe syntax is to SQL what Venn diagrams are to join illustrations. They look like a good idea for quickly understanding the basics, but they can make comprehension more complex, as their representations contradict the concepts they aim to depict.

- Wrt. aliases, they support them both as suffix on JOIN etc. and as `|> AS alias` step. https://github.com/google/zetasql/blob/master/docs/pipe-syntax.md#as-pipe-operator So I don't see that as weakness of the syntax, just something they didn't highlight in the article's examples? - The article never suggested pipes determine implementation order, just "semantic" order; users still need to understand the optimizer might change it wildly. You raise an interesting question whether pipe syntax might mislead them. But I suspect majority (me too) don't understand implementation well enough to "feel" what's good for performance (beyond sometimes listening when experts give rules of thumb like your advice "SELECT only what you need"). And isn't it true that before one can grow a feel for how actual scans happen, they need to understand the semantic order of SQL anyway? Pipe syntax helps in that part. IIUC you suggest the existing SELECT ... FROM ... WHERE order is actually closer to what happens in practice. But that can't be always true — existing SQL already allows re-ordering with sub-queries vs. CTEs as in your top-down vs. bottom-up example; pipe syntax essentially shorthand for CTEs.

Szilard Barany

Sales Engineer at Sayari

6 个月

Their lack of understanding of SQL (standard and code) does not give them the right to criticise it. SQL, despite being designed by a committee, is a remarkably compact and powerful language, that can, by the way, be easily read by humans. I don't have a problem writing (or reading) for example C++ code or regexps that look like sets of random characters, but I do appreciate the readability of SQL. Pipes and other borrowed syntax (from POSIX, for example) would ruin it without adding any value.

Guillaume Defendini ??

Lead data - Architecte data - DBA -

6 个月

very interesting

Mayuresh B.

Database Specialist @ Veeam Software | Databases, RDBMS, Performance Tuning

6 个月

It will be a hard sell to any existing company utilizing sql in the code base. Try explaining to management that you want to revamp millions of lines of code just because developers present in Google's survey (if they did any) find piped syntax easy to grasp.

要查看或添加评论,请登录

Franck Pachot的更多文章

社区洞察

其他会员也浏览了