Learn SQL the Effective Way Part 4 - CTEs and Subqueries
Last time we learnt about the importance of writing clean code with some tools to make it easier for us. Now that we're getting more advanced with our queries, we'll learn about Common Table Expressions (CTEs) and subqueries.?
CTEs and Subqueries allow you to write additional queries and use the results in another query. Sounds confusing? Here's an example.
Say I want to find out which F1 drivers set the top 10 fastest laps at the Monaco Grand Prix in 2016 and 2017, how should we do this???Follow along with these queries on data.world
No CTE / Subquery method
Normally we would write queries for 2016 and 2017, save these as tables and join them together to see the result. In doing so we've had to execute 3 queries and create 2 new tables in our database which we may not have permission to do.
CREATE TABLE mgp_2017 as (
SELECT
DRV.forename || ' ' || DRV.surname as fullname,
MAX(RES.fastestlapspeed :: decimal) as fastest_lap
FROM
results AS RES
INNER JOIN races AS RAC ON RAC.raceid = RES.raceid
INNER JOIN drivers AS DRV ON DRV.driverid = RES.driverid
WHERE
RES.fastestlapspeed :: decimal IS NOT NULL
AND RAC.name = 'Monaco Grand Prix'
AND YEAR(RAC.date) = 2017
GROUP BY
DRV.forename || ' ' || DRV.surname
ORDER BY
MAX(RES.fastestlapspeed :: decimal) DESC
LIMIT
10
);
CREATE TABLE mgp_2016 as (
SELECT
DRV.forename || ' ' || DRV.surname as fullname,
MAX(RES.fastestlapspeed :: decimal) as fastest_lap
FROM
results AS RES
INNER JOIN races AS RAC ON RAC.raceid = RES.raceid
INNER JOIN drivers AS DRV ON DRV.driverid = RES.driverid
WHERE
RES.fastestlapspeed :: decimal IS NOT NULL
AND RAC.name = 'Monaco Grand Prix'
AND YEAR(RAC.date) = 2017
GROUP BY
DRV.forename || ' ' || DRV.surname
ORDER BY
MAX(RES.fastestlapspeed :: decimal) DESC
LIMIT
10
);
SELECT
mgp_2017.fullname,
mgp_2016.fastest_lap as fastest_lap_2016,
mgp_2017.fastest_lap as fastest_lap_2017
FROM
mgp_2017
INNER JOIN mgp_2016 ON mgp_2017.fullname = mgp_2016.fullname;
Here we're using CTEs like the tables earlier but not writing them to the database. In doing so we only execute one query, which can be faster than executing several smaller queries in PostgreSQL.
WITH mgp_2017 AS (
SELECT
DRV.forename || ' ' || DRV.surname as fullname,
MAX(RES.fastestlapspeed :: decimal) as fastest_lap
FROM
results AS RES
INNER JOIN races AS RAC ON RAC.raceid = RES.raceid
INNER JOIN drivers AS DRV ON DRV.driverid = RES.driverid
WHERE
RES.fastestlapspeed :: decimal IS NOT NULL
AND RAC.name = 'Monaco Grand Prix'
AND YEAR(RAC.date) = 2017
GROUP BY
DRV.forename || ' ' || DRV.surname
ORDER BY
MAX(RES.fastestlapspeed :: decimal) DESC
LIMIT
10
)
,mgp_2016 AS (
SELECT
DRV.forename || ' ' || DRV.surname as fullname,
MAX(RES.fastestlapspeed :: decimal) as fastest_lap
FROM
results AS RES
INNER JOIN races AS RAC ON RAC.raceid = RES.raceid
INNER JOIN drivers AS DRV ON DRV.driverid = RES.driverid
WHERE
RES.fastestlapspeed :: decimal IS NOT NULL
AND RAC.name = 'Monaco Grand Prix'
AND YEAR(RAC.date) = 2016
GROUP BY
DRV.forename || ' ' || DRV.surname
ORDER BY
MAX(RES.fastestlapspeed :: decimal) DESC
LIMIT
10
)
SELECT
mgp_2017.fullname,
mgp_2016.fastest_lap as fastest_lap_2016,
mgp_2017.fastest_lap as fastest_lap_2017
FROM
mgp_2017
INNER JOIN mgp_2016 ON mgp_2017.fullname = mgp_2016.fullname
To declare our first CTE I should be of the form:
WITH {CTE Name} AS (
{Query}
)
for additional CTEs the "WITH" is not required and is replaced with a comma, e.g. ", CTE2 AS (..."
With subqueries, we return the same query result as the CTE but noticeably different code structure.
SELECT
mgp_2017.fullname,
mgp_2016.fastest_lap_2016,
mgp_2017.fastest_lap_2017
FROM (
SELECT
DRV.forename || ' ' || DRV.surname as fullname,
MAX(RES.fastestlapspeed :: decimal) as fastest_lap_2017
FROM
results AS RES
INNER JOIN races AS RAC ON RAC.raceid = RES.raceid
INNER JOIN drivers AS DRV ON DRV.driverid = RES.driverid
WHERE
RES.fastestlapspeed :: decimal IS NOT NULL
AND RAC.name = 'Monaco Grand Prix'
AND YEAR(RAC.date) = 2017
GROUP BY
DRV.forename || ' ' || DRV.surname
ORDER BY
MAX(RES.fastestlapspeed :: decimal) DESC
LIMIT
10
) AS mgp_2017
INNER JOIN (
SELECT
DRV.forename || ' ' || DRV.surname as fullname,
MAX(RES.fastestlapspeed :: decimal) as fastest_lap_2016
FROM
results AS RES
INNER JOIN races AS RAC ON RAC.raceid = RES.raceid
INNER JOIN drivers AS DRV ON DRV.driverid = RES.driverid
WHERE
RES.fastestlapspeed :: decimal IS NOT NULL
AND RAC.name = 'Monaco Grand Prix'
AND YEAR(RAC.date) = 2016
GROUP BY
DRV.forename || ' ' || DRV.surname
ORDER BY
MAX(RES.fastestlapspeed :: decimal) DESC
LIMIT
10
) AS mgp_2016 ON mgp_2016.fullname = mgp_2017.fullname
You can see our subqueries take the place of tables we SELECT FROM or JOIN to, i.e.
领英推荐
FROM {Query} AS alias
LEFT|INNER JOIN {Query} AS alias2 ON alias.id = alias2.id
but there's more, one of the advantages of subqueries is that don't just act as tables, you can use a subquery as a column or part of the WHERE or HAVING clauses, provided they return one value.
In the example below we're showing all F1 drivers in the 2017 Monaco GP that beat the fastest lap time in 2016 by using a subquery in the where clause.
SELECT?
? DRV.forename || ' ' || DRV.surname as fullname,?
? MAX(RES.fastestlapspeed :: decimal) as fastest_lap_2017?
FROM?
? results AS RES?
? INNER JOIN races AS RAC ON RAC.raceid = RES.raceid?
? INNER JOIN drivers AS DRV ON DRV.driverid = RES.driverid?
WHERE?
? RES.fastestlapspeed :: decimal IS NOT NULL?
? AND RAC.name = 'Monaco Grand Prix'?
? AND YEAR(RAC.date) = 2017?
? AND RES.fastestlapspeed :: decimal > (
? ? SELECT?
? ? ? MAX(fastestlapspeed :: decimal)?
? ? FROM?
? ? ? results AS RES?
? ? ? INNER JOIN races AS RAC ON RAC.raceid = RES.raceid?
? ? WHERE?
? ? ? YEAR(RAC.date) = 2016?
? ? ? AND RAC.name = 'Monaco Grand Prix'
? )?
GROUP BY?
? DRV.forename || ' ' || DRV.surname?
ORDER BY?
? MAX(RES.fastestlapspeed :: decimal) DESC
Which is better? CTE or Subquery
Both techniques can achieve the same results with comparable query times, in different situations, one may perform better than the other.
I generally prefer CTEs because:
1. Easier to read and debug
If your query isn't returning the correct data it's very easy to check what the CTE is doing
2. More reusable
When we declare and name a CTE we can reuse it in multiple joins, be that in other CTEs or our final SELECT statement.
3. CTEs can be recursive
Recursive CTEs are self-referencing and are very useful for creating a list of values, e.g. 1-10, or dates 2022-01-01 to 2022-12-01. Recursive CTEs are not supported on data.world but running the following query on Snowflake will produce a column of numbers 1-10.
WITH numbers AS (
? SELECT?
? ? 1 as n?
? UNION ALL?
? SELECT?
? ? n + 1 as n?
? FROM?
? ? numbers?
? WHERE?
? ? n + 1 <= 10
)?
SELECT?
? n?
FROM?
? numbers
But I will still use subqueries for: