登录查看更多内容

SQL window functions: Rows, range, unbounded preceding

Octavian Zarzu

#dataengineering ??

发布日期: 2020年3月16日

When it comes to analytics, SQL has many advanced functions. In this article, we will look at window functions, focusing on the OVER clause that accepts the following parameters:

PARTITION BY clause
ORDER BY clause
ROWS or RANGE clause

We will be looking at different variations for ROWS or RANGE clause, including PRECEDING, FOLLOWING, and UNBOUNDED. We will also cover LEAD, LAG, ROW_NUMBER, RANK, and DENSE_RANK.

All queries are executed on Snowflake DB. Sign up for a free 30 days trial account at https://trial.snowflake.com/. All SQL commands in this article are available at the end of it.

1. Input data

2. Load and clean data in Snowflake

3. Window functions

LEAD and LAG
ROW_NUMBER, RANK, and DENSE_RANK
ROWS PRECEDING and ROWS FOLLOWING
UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
ROWS vs. RANGE

1. Input data

We will be using weather data from meteofblue from 01/27 to 02/10 for Bucharest on an hourly basis, focusing on the temperature parameter only.

2. Load and clean data in Snowflake

1. Let's take a quick look at the data.

We see we have some metadata rows at the beginning of the file that we will have to skip when loading (12 rows), and the column structure is: [year, month, day, hour, minute, temperature].

2. Create a database, schema, and table

# Create database, schema, table, file format 
CREATE DATABASE IF NOT EXISTS WEATHER_DATA;
USE DATABASE WEATHER_DATA;


CREATE SCHEMA IF NOT EXISTS ROMANIA;
USE SCHEMA ROMANIA;


## Table will contain only one column of datatype VARIANT that stores the JSON file
CREATE OR REPLACE TABLE BUCHAREST
(
  year varchar(4),
  month integer,
  day integer,
  hour integer,
  minute integer,
  temperature float
);

2. Load data using Snowflake Web UI

In the file format, we will specify the number of rows to skip and the delimiter (;).

If the file is loaded successfully, the table should contain 360 rows.

By looking at the data, we see the temperature column contains decimals. For our purpose, it will be easier to follow along if we round the temperature values.

UPDATE BUCHAREST
SET temperature = round(temperature);

SELECT *
FROM BUCHAREST;

Great. Now we see MINUTE column only contains 0s, and we would want to roll up to Day level, therefore let's start grouping the data and wrap it into a view:

CREATE VIEW bucharest_temperature_daylevel as
SELECT YEAR,MONTH,DAY,ROUND(AVG(TEMPERATURE)) as TEMPERATURE
FROM BUCHAREST
GROUP BY YEAR,MONTH,DAY;

This view will be our baseline for all queries.

3. Window functions

To exemplify the window functions, let's ask some questions about our dataset.

Q: How was the temperature on the previous day?

There is a particular function, called lag, that does precisely that, so let's jump right into it and see how to use it.

LAG AND LEAD functions

LAG - "Accesses data in a previous row in the same result set without having to join the table to itself." - https://docs.snowflake.net/manuals/sql-reference/functions/lag.html

SELECT year,month,day, temperature, lag(temperature) over (order by year,month,day) as prev_temperature
FROM bucharest_temperature_daylevel;

We use the `order by` clause to specify how to sort the input before applying the lag function. LAG can accept a second parameter, which determines how many steps to go back. In our case, we only wanted to go one step back (previous day), therefore using the second parameter (default = 1) was not required.

The opposite of LAG is LEAD. While LAG looks at previous values, LEAD looks forward.

Both LEAD and LAG also accept the PARTITION BY clause in the OVER condition. We will cover the PARTITION BY clause in the following examples, but feel free to come back to this example and PARTITION your data by year and month to see how it behaves.

For now, keep in mind, you can also specify an empty set of columns for partition, and it will behave in the same manner to an OVER condition without a PARTITION BY clause. The following two statements are identical:

SELECT year,month,day, temperature, lag(temperature) over (order by year,month,day) as prev_temperature
FROM bucharest_temperature_daylevel;



SELECT year,month,day, temperature, lag(temperature) over (partition by '' order by year,month,day) as prev_temperature
FROM bucharest_temperature_daylevel;

Q: How can you remove the last day of each month?

To accomplish this, we need to:

identify the last day of each month based on the dataset values
remove those lines.

To identify the last day, we need to partition/slice the data at the month level and sort the results in a manner that the last day of the partition is first. Then using a function, we would want to assign each row, a value in the order they appear in the partition. That function is called ROW_NUMBER.

ROW_NUMBER function

"Returns a unique row number for each row within a window partition" - https://docs.snowflake.net/manuals/sql-reference/functions/row_number.html

SELECT year,month,day, temperature, row_number() over (partition by year, month order by year,month,day desc) as last_day_of_month_from_data
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;

Two things to notice:

PARTITION BY - specifies how do we want to chunk our dataset into partitions - in our case, we want to partition at the year-month level since we are interested in finding the last value for each year-month frame.
ORDER BY - specifies in what sequence the values within those partitions should be fetched - in our case, we want rows sorted in descending way by `day`.

RANK and DENSE_RANK functions

RANK and DENSE_RANK are similar to ROW_NUMBER. The only difference is how they treat duplicates. Let's put them side by side and see the outcome.

Instead of partitioning at the year-month level, we will partition at the year level. The reason for this is that at the year-month level, all days have different values (each day has a unique value for a given month), whereas, at the year level, we have multiple rows with the same month (5 rows corresponding to January and 10 to rows to February).

SELECT year,month, 
row_number() over (partition by year order by month) as rn,
rank() over (partition by year order by month) as rank,
dense_rank() over (partition by year order by month) as dense_rank
FROM bucharest_temperature_daylevel
ORDER BY 1,2;

ROW_NUMBER outputs an incremental value for each row in a partition (in our case at the year level).

On the other side, RANK, and DENSE_RANK stop incrementing the value when two rows have matching value (same month). Both functions assign the same value for matching rows, but the difference lies in what value they assign for the next non-matching value.

DENSE_RANK function assigns the next value in the iterator, while RANK keeps an iterator behind the scene for matching values and leverages it when finding the next value. RANK: 1,1,1,1,1 (5 rows), next value is 6.

DENSE_RANK: 1,1,1,1,1 (5 rows), next value is 2.

Q: What was the coldest 4-day period?

So far, we looked at examples that operate at the partition level, without having any control over what rows from that partition to select for computation.

In our case, we don't even have to specify the partition, as we are also interested in computing the value for Jan 30th - Feb 2nd window. If we define the year-month partition, we will split the dataset into two frames (January and February).

To specify how many rows to capture as part of the computing, we will use the ROWS clause.

ROWS PRECEDING and ROWS FOLLOWING

Preceding refers to rows before the current row, where the following argument refers to rows after the current row. We can specify a fixed value for preceding or following, or as we will see later on, we can determine the limits of the partition with the UNBOUNDED.

SELECT year,month,day,temperature, 
avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN 3 preceding and current row) last_3_days_avg
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;

Let's pause for a second and understand what happened:

partition by '' - means there is no partition - as stated above, we are not interested in partitioning the data for this exercise
order by year, month, day - we want the values sorted chronologically
ROWS BETWEEN 3 preceding and current row - looks at three rows above the current row and current row
AVG(temperature) - with the four rows selected (specified through the ROWS clause above), it computes the average.

Every value in the new column will contain the average of the past three days and the current day.

For example: Feb 3rd - Average: 9°C = (Jan 31st value (5°C) + Feb 1st value (9°C) + Feb 2nd value (10°C) + Feb 3rd value (12°C))/4 = 36/4°C

The coldest 4-day period was Feb 6th to Feb 9th, with an average of 1°C. The value is recorded on the last day of the interval, Feb 9th - the upper limit of the interval specified by our formula).

The same result can be achieved if we specify the limits of the window differently.

SELECT year,month,day,temperature, 
avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN 2 preceding and 1 following) last_3_days_avg
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;

In this case, we look at the two rows before our current row, current row, and one row following the current row. We get the same minimum value for the four days on Feb 8th line, composed of values spanning from Feb 6th to Feb 9th.

Q: What is the cumulative average temperature?

We looked at ROWS PRECEDING and ROWS FOLLOWING clauses, which lets us control the window size (we can visualize it as a frame that slides through our partition). But what if we want to cumulate the values? The frame size increases as we move through the partition: for the 3rd row, we will look at the previous two rows; for the 10th row, we will look at the last nine rows.

UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

To achieve this, we need to fix one of the ends of the frame and have the other one slide. To do this, SQL allows for the UNBOUNDED keyword to refer to one of the edges.

These windows are called cumulative frames, and their purpose is to continuously built on the previous result set.

In our example, let's average all rows until the previous row, by specifying the two limits as:

(unbounded preceding, 1 preceding)

SELECT year,month,day,temperature, 
avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN unbounded preceding and 1 preceding) preceding_days_avg
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;

Let us validate the results: 6th row (Feb 1st) - (6+8+7+5+5) / 5 = 31/5 = 6.2\

Q: What is the monthly average, while keeping the same granularity of the dataset?

We can compute the average grouping by year and month, but then we lose the dataset granularity, as we roll up results at the year-month level. To resolve this, we would have to join back with our original dataset - but this solution is expensive if you are dealing with a high volume of data.

ROWS and RANGES

Lucky for us, in SQL, we can leverage a variation of ROWS, namely the RANGE clause. When using the ROWS clause, duplicates are treated the same as unique values: the iterator moves through the ordered rows of the partition and computes the function specified.

RANGE is different: if RANGE encounters duplicate values (based on the order by clause), it computes the function upfront and yields the same result for all duplicate rows.

In our case, we want to partition by year-month - this is a repeatable tuple for January (5 rows) and February (10 rows). Given this partition, when we use the clause:

RANGE, it will compute the function (average) upfront for all rows
ROWS, it will calculate the function (average) as it slides through the dataset.

We are interested in computing all rows up until the current row, so we will use BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Let's run RANGE and ROWS side by side to compare results.

SELECT year,month,day,temperature,
avg(temperature) over (partition by year, month order by year,month ROWS BETWEEN unbounded preceding and current row) rows_avg,
avg(temperature) over (partition by year, month order by year,month RANGE BETWEEN unbounded preceding and current row) range_avg
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;

Conclusion

That concludes our tutorial on window functions. Unlike scalar and simple aggregate functions, window functions are not easily replicated in other data manipulation languages, such as Python and require a lot more coding and processing power to match what SQL offers.

There are many other window functions aren't covered in this article, like nth_value, cume_dist, ntile, and others. You can go over all of them here: https://docs.snowflake.net/manuals/sql-reference/functions-analytic.html

Full SQL code

# Create database, schema, table, file format 
CREATE DATABASE IF NOT EXISTS WEATHER_DATA;
USE DATABASE WEATHER_DATA;


CREATE SCHEMA IF NOT EXISTS ROMANIA;
USE SCHEMA ROMANIA;


## Table will contain only one column of datatype VARIANT that stores the JSON file
CREATE OR REPLACE TABLE BUCHAREST
(
  year varchar(4),
  month integer,
  day integer,
  hour integer,
  minute integer,
  temperature float
);




SELECT *
FROM BUCHAREST;


UPDATE BUCHAREST
SET temperature = round(temperature);


SELECT *
FROM BUCHAREST;




CREATE VIEW bucharest_temperature_daylevel as
SELECT YEAR,MONTH,DAY,ROUND(AVG(TEMPERATURE)) as TEMPERATURE
FROM BUCHAREST
GROUP BY YEAR,MONTH,DAY;


SELECT * 
FROM bucharest_temperature_daylevel;




SELECT year,month,day, temperature, lag(temperature) over (order by year,month,day) as prev_temperature
FROM bucharest_temperature_daylevel;




SELECT year,month,day, temperature, lag(temperature) over (partition by '' order by year,month,day) as prev_temperature
FROM bucharest_temperature_daylevel;






SELECT year,month,day, temperature, row_number() over (partition by year, month order by year,month,day desc) as last_day_of_month_from_data
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;




SELECT year,month, row_number() over (partition by year order by month) as rn,
rank() over (partition by year order by month) as rank,
dense_rank() over (partition by year order by month) as dense_rank
FROM bucharest_temperature_daylevel
ORDER BY 1,2;




SELECT year,month,day,temperature, 
avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN 3 preceding and current row) last_3_days_avg
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;


SELECT year,month,day,temperature, 
avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN 2 preceding and 1 following) last_3_days_avg
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;


SELECT year,month,day,temperature, 
avg(temperature) over (partition by '' order by year,month,day ROWS BETWEEN unbounded preceding and 1 preceding) preceding_days_avg
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;






SELECT year,month,avg(temperature) as temp
FROM bucharest_temperature_daylevel
GROUP BY 1,2;




SELECT year,month,day,temperature,
avg(temperature) over (partition by year, month order by year,month ROWS BETWEEN unbounded preceding and current row) rows_avg,
avg(temperature) over (partition by year, month order by year,month RANGE BETWEEN unbounded preceding and current row) range_avg
FROM bucharest_temperature_daylevel
ORDER BY 1,2,3;

Amira Shaker

Data Analyst

1 年

In ROWS can we use rows between specified value and another specified value if not how can i achieve it ?

Alex Fernando dos Santos Silva

Big Data Engineer | Analytics

4 年

Window functions are awesome! I just learned it these days. You could cover regex as well, studying it right now.

2 次回应

查看更多评论

SQL window functions: Rows, range, unbounded preceding

Octavian Zarzu

#dataengineering ??

Table of Contents

1. Input data

2. Load and clean data in Snowflake

3. Window functions

Q: How was the temperature on the previous day?

LAG AND LEAD functions

Q: How can you remove the last day of each month?

ROW_NUMBER function

RANK and DENSE_RANK functions

Q: What was the coldest 4-day period?

ROWS PRECEDING and ROWS FOLLOWING

Q: What is the cumulative average temperature?

UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

Q: What is the monthly average, while keeping the same granularity of the dataset?

ROWS and RANGES

Conclusion

Full SQL code

更多精彩文章

社区洞察

其他会员也浏览了