SQL window functions: Rows, range, unbounded preceding
When it comes to analytics, SQL has many advanced functions. In this article, we will look at window functions, focusing on the OVER clause that accepts the following parameters:
- PARTITION BY clause
- ORDER BY clause
- ROWS or RANGE clause
We will be looking at different variations for ROWS or RANGE clause, including PRECEDING, FOLLOWING, and UNBOUNDED. We will also cover LEAD, LAG, ROW_NUMBER, RANK, and DENSE_RANK.
All queries are executed on Snowflake DB. Sign up for a free 30 days trial account at https://trial.snowflake.com/. All SQL commands in this article are available at the end of it.
Table of Contents
1. Input data
2. Load and clean data in Snowflake
3. Window functions
- LEAD and LAG
- ROW_NUMBER, RANK, and DENSE_RANK
- ROWS PRECEDING and ROWS FOLLOWING
- UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
- ROWS vs. RANGE
1. Input data
We will be using weather data from meteofblue from 01/27 to 02/10 for Bucharest on an hourly basis, focusing on the temperature parameter only.
2. Load and clean data in Snowflake
1. Let's take a quick look at the data.
We see we have some metadata rows at the beginning of the file that we will have to skip when loading (12 rows), and the column structure is: [year, month, day, hour, minute, temperature].
2. Create a database, schema, and table
# Create database, schema, table, file format CREATE DATABASE IF NOT EXISTS WEATHER_DATA; USE DATABASE WEATHER_DATA; CREATE SCHEMA IF NOT EXISTS ROMANIA; USE SCHEMA ROMANIA; ## Table will contain only one column of datatype VARIANT that stores the JSON file CREATE OR REPLACE TABLE BUCHAREST ( year varchar(4), month integer, day integer, hour integer, minute integer, temperature float );
2. Load data using Snowflake Web UI
In the file format, we will specify the number of rows to skip and the delimiter (;).
If the file is loaded successfully, the table should contain 360 rows.
By looking at the data, we see the temperature column contains decimals. For our purpose, it will be easier to follow along if we round the temperature values.
UPDATE BUCHAREST SET temperature = round(temperature); SELECT * FROM BUCHAREST;
Great. Now we see MINUTE column only contains 0s, and we would want to roll up to Day level, therefore let's start grouping the data and wrap it into a view:
CREATE VIEW bucharest_temperature_daylevel as SELECT YEAR,MONTH,DAY,ROUND(AVG(TEMPERATURE)) as TEMPERATURE FROM BUCHAREST GROUP BY YEAR,MONTH,DAY;
This view will be our baseline for all queries.
3. Window functions
To exemplify the window functions, let's ask some questions about our dataset.
Q: How was the temperature on the previous day?
There is a particular function, called lag, that does precisely that, so let's jump right into it and see how to use it.
LAG AND LEAD functions
LAG - "Accesses data in a previous row in the same result set without having to join the table to itself." - https://docs.snowflake.net/manuals/sql-reference/functions/lag.html
SELECT year,month,day, temperature, lag(temperature) over (order by year,month,day) as prev_temperature FROM bucharest_temperature_daylevel;
We use the `order by` clause to specify how to sort the input before applying the lag function. LAG can accept a second parameter, which determines how many steps to go back. In our case, we only wanted to go one step back (previous day), therefore using the second parameter (default = 1) was not required.
The opposite of LAG is LEAD. While LAG looks at previous values, LEAD looks forward.
Both LEAD and LAG also accept the PARTITION BY clause in the OVER condition. We will cover the PARTITION BY clause in the following examples, but feel free to come back to this example and PARTITION your data by year and month to see how it behaves.
For now, keep in mind, you can also specify an empty set of columns for partition, and it will behave in the same manner to an OVER condition without a PARTITION BY clause. The following two statements are identical:
SELECT year,month,day, temperature, lag(temperature) over (order by year,month,day) as prev_temperature FROM bucharest_temperature_daylevel; SELECT year,month,day, temperature, lag(temperature) over (partition by '' order by year,month,day) as prev_temperature FROM bucharest_temperature_daylevel;
Q: How can you remove the last day of each month?
To accomplish this, we need to:
- identify the last day of each month based on the dataset values
- remove those lines.
To identify the last day, we need to partition/slice the data at the month level and sort the results in a manner that the last day of the partition is first. Then using a function, we would want to assign each row, a value in the order they appear in the partition. That function is called ROW_NUMBER.
ROW_NUMBER function
"Returns a unique row number for each row within a window partition" - https://docs.snowflake.net/manuals/sql-reference/functions/row_number.html
SELECT year,month,day, temperature, row_number() over (partition by year, month order by year,month,day desc) as last_day_of_month_from_data FROM bucharest_temperature_daylevel ORDER BY 1,2,3;
Two things to notice:
- PARTITION BY - specifies how do we want to chunk our dataset into partitions - in our case, we want to partition at the year-month level since we are interested in finding the last value for each year-month frame.
- ORDER BY - specifies in what sequence the values within those partitions should be fetched - in our case, we want rows sorted in descending way by `day`.
RANK and DENSE_RANK functions
RANK and DENSE_RANK are similar to ROW_NUMBER. The only difference is how they treat duplicates. Let's put them side by side and see the outcome.
Instead of partitioning at the year-month level, we will partition at the year level. The reason for this is that at the year-month level, all days have different values (each day has a unique value for a given month), whereas, at the year level, we have multiple rows with the same month (5 rows corresponding to January and 10 to rows to February).
SELECT year,month, row_number() over (partition by year order by month) as rn, rank() over (partition by year order by month) as rank, dense_rank() over (partition by year order by month) as dense_rank FROM bucharest_temperature_daylevel ORDER BY 1,2;
ROW_NUMBER outputs an incremental value for each row in a partition (in our case at the year level).
On the other side, RANK, and DENSE_RANK stop incrementing the value when two rows have matching value (same month). Both functions assign the same value for matching rows, but the difference lies in what value they assign for the next non-matching value.
DENSE_RANK function assigns the next value in the iterator, while RANK keeps an iterator behind the scene for matching values and leverages it when finding the next value. RANK: 1,1,1,1,1 (5 rows), next value is 6.
DENSE_RANK: 1,1,1,1,1 (5 rows), next value is 2.
Q: What was the coldest 4-day period?
So far, we looked at examples that operate at the partition level, without having any control over what rows from that partition to select for computation.
In our case, we don't even have to specify the partition, as we are also interested in computing the value for Jan 30th - Feb 2nd window. If we define the year-month partition, we will split the dataset into two frames (January and February).
To specify how many rows to capture as part of the computing, we will use the ROWS clause.
ROWS PRECEDING and ROWS FOLLOWING
Preceding refers to rows before the current row, where the following argument refers to rows after the current row. We can specify a fixed value for preceding or following, or as we will see later on, we can determine the limits of the partition with the UNBOUNDED.
SELECT year,month,day,temperature, avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN 3 preceding and current row) last_3_days_avg FROM bucharest_temperature_daylevel ORDER BY 1,2,3;
Let's pause for a second and understand what happened:
- partition by '' - means there is no partition - as stated above, we are not interested in partitioning the data for this exercise
- order by year, month, day - we want the values sorted chronologically
- ROWS BETWEEN 3 preceding and current row - looks at three rows above the current row and current row
- AVG(temperature) - with the four rows selected (specified through the ROWS clause above), it computes the average.
Every value in the new column will contain the average of the past three days and the current day.
For example: Feb 3rd - Average: 9°C = (Jan 31st value (5°C) + Feb 1st value (9°C) + Feb 2nd value (10°C) + Feb 3rd value (12°C))/4 = 36/4°C
The coldest 4-day period was Feb 6th to Feb 9th, with an average of 1°C. The value is recorded on the last day of the interval, Feb 9th - the upper limit of the interval specified by our formula).
The same result can be achieved if we specify the limits of the window differently.
SELECT year,month,day,temperature, avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN 2 preceding and 1 following) last_3_days_avg FROM bucharest_temperature_daylevel ORDER BY 1,2,3;
In this case, we look at the two rows before our current row, current row, and one row following the current row. We get the same minimum value for the four days on Feb 8th line, composed of values spanning from Feb 6th to Feb 9th.
Q: What is the cumulative average temperature?
We looked at ROWS PRECEDING and ROWS FOLLOWING clauses, which lets us control the window size (we can visualize it as a frame that slides through our partition). But what if we want to cumulate the values? The frame size increases as we move through the partition: for the 3rd row, we will look at the previous two rows; for the 10th row, we will look at the last nine rows.
UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
To achieve this, we need to fix one of the ends of the frame and have the other one slide. To do this, SQL allows for the UNBOUNDED keyword to refer to one of the edges.
These windows are called cumulative frames, and their purpose is to continuously built on the previous result set.
In our example, let's average all rows until the previous row, by specifying the two limits as:
(unbounded preceding, 1 preceding)
SELECT year,month,day,temperature, avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN unbounded preceding and 1 preceding) preceding_days_avg FROM bucharest_temperature_daylevel ORDER BY 1,2,3;
Let us validate the results: 6th row (Feb 1st) - (6+8+7+5+5) / 5 = 31/5 = 6.2\
Q: What is the monthly average, while keeping the same granularity of the dataset?
We can compute the average grouping by year and month, but then we lose the dataset granularity, as we roll up results at the year-month level. To resolve this, we would have to join back with our original dataset - but this solution is expensive if you are dealing with a high volume of data.
ROWS and RANGES
Lucky for us, in SQL, we can leverage a variation of ROWS, namely the RANGE clause. When using the ROWS clause, duplicates are treated the same as unique values: the iterator moves through the ordered rows of the partition and computes the function specified.
RANGE is different: if RANGE encounters duplicate values (based on the order by clause), it computes the function upfront and yields the same result for all duplicate rows.
In our case, we want to partition by year-month - this is a repeatable tuple for January (5 rows) and February (10 rows). Given this partition, when we use the clause:
- RANGE, it will compute the function (average) upfront for all rows
- ROWS, it will calculate the function (average) as it slides through the dataset.
We are interested in computing all rows up until the current row, so we will use BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
Let's run RANGE and ROWS side by side to compare results.
SELECT year,month,day,temperature, avg(temperature) over (partition by year, month order by year,month ROWS BETWEEN unbounded preceding and current row) rows_avg, avg(temperature) over (partition by year, month order by year,month RANGE BETWEEN unbounded preceding and current row) range_avg FROM bucharest_temperature_daylevel ORDER BY 1,2,3;
Conclusion
That concludes our tutorial on window functions. Unlike scalar and simple aggregate functions, window functions are not easily replicated in other data manipulation languages, such as Python and require a lot more coding and processing power to match what SQL offers.
There are many other window functions aren't covered in this article, like nth_value, cume_dist, ntile, and others. You can go over all of them here: https://docs.snowflake.net/manuals/sql-reference/functions-analytic.html
Full SQL code
# Create database, schema, table, file format CREATE DATABASE IF NOT EXISTS WEATHER_DATA; USE DATABASE WEATHER_DATA; CREATE SCHEMA IF NOT EXISTS ROMANIA; USE SCHEMA ROMANIA; ## Table will contain only one column of datatype VARIANT that stores the JSON file CREATE OR REPLACE TABLE BUCHAREST ( year varchar(4), month integer, day integer, hour integer, minute integer, temperature float ); SELECT * FROM BUCHAREST; UPDATE BUCHAREST SET temperature = round(temperature); SELECT * FROM BUCHAREST; CREATE VIEW bucharest_temperature_daylevel as SELECT YEAR,MONTH,DAY,ROUND(AVG(TEMPERATURE)) as TEMPERATURE FROM BUCHAREST GROUP BY YEAR,MONTH,DAY; SELECT * FROM bucharest_temperature_daylevel; SELECT year,month,day, temperature, lag(temperature) over (order by year,month,day) as prev_temperature FROM bucharest_temperature_daylevel; SELECT year,month,day, temperature, lag(temperature) over (partition by '' order by year,month,day) as prev_temperature FROM bucharest_temperature_daylevel; SELECT year,month,day, temperature, row_number() over (partition by year, month order by year,month,day desc) as last_day_of_month_from_data FROM bucharest_temperature_daylevel ORDER BY 1,2,3; SELECT year,month, row_number() over (partition by year order by month) as rn, rank() over (partition by year order by month) as rank, dense_rank() over (partition by year order by month) as dense_rank FROM bucharest_temperature_daylevel ORDER BY 1,2; SELECT year,month,day,temperature, avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN 3 preceding and current row) last_3_days_avg FROM bucharest_temperature_daylevel ORDER BY 1,2,3; SELECT year,month,day,temperature, avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN 2 preceding and 1 following) last_3_days_avg FROM bucharest_temperature_daylevel ORDER BY 1,2,3; SELECT year,month,day,temperature, avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN unbounded preceding and 1 preceding) preceding_days_avg FROM bucharest_temperature_daylevel ORDER BY 1,2,3; SELECT year,month,avg(temperature) as temp FROM bucharest_temperature_daylevel GROUP BY 1,2; SELECT year,month,day,temperature, avg(temperature) over (partition by year, month order by year,month ROWS BETWEEN unbounded preceding and current row) rows_avg, avg(temperature) over (partition by year, month order by year,month RANGE BETWEEN unbounded preceding and current row) range_avg FROM bucharest_temperature_daylevel ORDER BY 1,2,3;
Data Analyst
1 年In ROWS can we use rows between specified value and another specified value if not how can i achieve it ?
Big Data Engineer | Analytics
4 年Window functions are awesome! I just learned it these days. You could cover regex as well, studying it right now.