Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook
Chonghua Yin
Head of Data Science | Climate Risk & Extreme Event Modeling | AI & Geospatial Analytics
In the previous tutorials, we discussed aggregate functions that can take a large group of rows and output a single value for them. Often, being able to summarize a group of rows to a single value is important and useful. However, there are times when you want to keep the individual rows as well as gaining a summarizing value. To do this, a new set of functions named window functions come into our field of vision.
PostgreSQL's documentation does an excellent job of introducing the concept of Window Functions:
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
1. Window Functions within SQLite
Window Functions get available since SQLite 3.25, which includes these built-in Window functions:
- row_number()
- rank()
- dense_rank()
- percent_rank()
- cume_dist()
- ntile(N)
- lag(expr), lag(expr, offset), lag(expr, offset, default)
- lead(expr), lead(expr, offset), lead(expr, offset, default)
- first_value(expr)
- last_value(expr)
- nth_value(expr, N)
Previous tutorials are created based on a little bit older SQLite versions. At that time, window Functions are not available. In this tutorial, I am going to touch some of them in the following 4 sections:
- First, go through a few WINDOW functions based on regular aggregate functions, such as AVG, MIN/MAX, COUNT, SUM.
- Second, focus on rank-related functions, such as ROW_NUMBER, RANK,and RANK_DENSE. Those functions are handy when generating ranking indexes, and you need to be fluent in those functions before entering a data scientist SQL interview.
- Third, talk about generating statistics (e.g., percentiles, quartiles, median, etc .) with the NTILE function, a common task for a data scientist.
- Forth, focus on LAG and LEAD, two functions that are super important if you are interviewing for a role that requires dealing with time-series data.
We presume you already set up the test environment following the previous tutoiral. However, we use ipython-sql Magic functions to query a sqllite database as it is more straightforward and concise. ipython-sql introduces a %sql (or %%sql) magic to notebooks allowing you to connect to a database, using SQLAlchemy connect strings, then issue SQL commands within IPython or Jupyter/IPython Notebook. However, if you do not use the %%sql magic in your Jupyter notebook, the output of your SQL queries will be just a plain list of tuples. A better way to work with the result sets returned is to draw them as a table with the headers. This is where the IPython SQL magic gets very handy.
No worries, all tutorials and demonstration data are available at the link on Github.
%load_ext sql %sql sqlite:///data/demo.db3
2. Practice Window Functions
The following is the basic syntax of a window function:
SELECT {columns}, {window_func} OVER (PARTITION BY {partition_key} ORDER BY {order_key}) FROM table1;
Where {columns} are the columns to retrieve from tables for the query, {window_func} is the window function you want to use, {partition_key} is the column or columns you want to partition on (more on this later), {order_key} is the column or columns you want to order by, and table1 is the table or joined tables you want to pull data from. The OVER keyword indicates where the window definition starts.
Let's take the table of watershed_monthly in the sqlite database to practice these window functions.
2.1 The usual suspects: AVG, MIN/MAX, SUM, and COUNT
When using window functions, you can apply the same aggregates that you would under normal circumstances—AVG, MIN/MAX, SUM, and COUNT. The easiest way to understand these is to re-run the previous example with some additional functions.
2.1.1 AVG
SELECT YR, MO, PREC_mm, ROUND(AVG(PREC_mm) OVER(PARTITION BY MO)) AS avg_PREC_mm FROM watershed_monthly ORDER by MO LIMIT 12 Done.
Out:
There is no GROUP BY clause for the AVG function, but how does the SQL engine know which rows to use to compute the average? The answer is the PARTITION BY clause inside the OVER() utility, and we are calculating the average based on a unique value of rating.
In the final output, every row has the average PREC_mm from the same month (MO). By adding a ORDER by, you will find there is the same average value for each month.
2.1.2 SUM
Let’s take a look at a more complicated example, where we calculated a running sum with a window function.
%%sql sqlite:// SELECT YR, MO, PREC_mm, SUM(PREC_mm) OVER(ORDER BY MO) AS running_total, SUM(PREC_mm) OVER() AS overall, ROUND(SUM(PREC_mm) OVER(ORDER BY MO) * 100.0 /SUM(PREC_mm) OVER(), 2) AS running_percentage FROM watershed_monthly LIMIT 12
Notice that there is no PARTITION BY clause because I am not grouping those into any PREC_mm sub-categories. I want to compute my overall progress but not based on any subgroups or categories.
Another thing to notice is that if I don’t add anything inside the OVER() function, I get the total number of PREC_mm from the entire table. As you can see from the second last column: they all have the same value of 24084.34, but after I add the ORDER BY clause, I get the running total of the PREC_mm up to that specific row (running_total column).
2.1.3 Others
You can try MIN, COUNT, AVG by yourself, which are all similar to previous examples of SUM and MAX.
2.2 ROW_NUMBER, RANK, DENSE_RANK
Let’s go through some of the essential WINDOW functions: ROW_NUMBER and RANK. In the following example, our goal is to create a ranking index based on the amount of PREC_mm for the entire table.
2.2.1 ROW_NUMBER
%%sql sqlite:// SELECT YR, MO, PREC_mm, ROW_NUMBER() OVER(ORDER BY PREC_mm DESC) AS row_num FROM watershed_monthly ORDER by row_num LIMIT 12
As you can see, the ROW_NUMBER function generates a sequence of integers, starting from 1, for each row.
2.2.2 ROW_NUMBER over PARTITION
Let’s take a look at another example. Instead of comparing a amount of PREC_mm to all other PREC_mm from the entire table, we can rank them within each month using PARTITION BY.
%%sql sqlite:// SELECT YR, MO, PREC_mm, ROW_NUMBER() OVER(PARTITION BY MO ORDER BY PREC_mm DESC) AS row_num FROM watershed_monthly ORDER by row_num
ROW_NUMBER with OVER and PARTITION BY is a regular pattern that is frequently used in advanced SQL.
2.2.3 Others
Now you can try the RANK and DENSE_RANK functions by yourself as they are very similar to ROW_NUMBER. The difference between RANK and ROW_NUMBER is that RANK assigns the same unique values if there is a tie and restarts the next value with the total number of rows up to that row.
DENSE_RANK. It is very similar to RANK but differs in how it handles ties. It restarts with the following immediate consecutive value rather than creating a gap.
2.3 NTILE
You can use window functions to identify what percentile (or quartile, or any other subdivision) a given row falls into. The syntax is NTILE(# of buckets). In this case, ORDER BY determines which column to use to determine the quartiles (or whatever number of 'tiles you specify).
NTILE takes an argument of the number of buckets and then creates this number of buckets as equally as possible, based on how the rows are partitioned and ordered inside the OVER function.
%%sql sqlite:// SELECT YR, MO, PREC_mm, NTILE(100) OVER(ORDER BY PREC_mm) AS percentile FROM watershed_monthly ORDER BY percentile
You can create a few more statistics, such as DECILES (10 buckets) and QUARTILES (4 buckets).
2.4 LAG and Lead
It can often be useful to compare rows to preceding or following rows, especially if you've got the data in an order that makes sense. You can use LAG or LEAD to create columns that pull values from other rows—all you need to do is enter which column to pull from and how many rows away you'd like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows.
Their syntax is very similar to other window functions. Instead of focusing on the format of the syntax, let me show you a couple of examples.
2.4.1 LAG
- In the first step, we created a temporal table of ann_total with PREC_mm.
- And in the second step, we appended the previous year’s Ann_PREC_mm to the current year’s using the LAG function.
- Notice that last 2 columns of the first row are empty. It’s simply because 1981 is the first available year.
- We also specified the offset, which is 1, so we fetch the next row. If you change this number to 2, then you compare the current year’s Ann_PREC_mm to the year before the previous year.
- Finally, we divided the current year’s Ann_PREC_mm by the previous year’s to create our year Ann_PREC_mm growth.
%%sql sqlite:// WITH annul_total AS ( SELECT YR, SUM(PREC_mm) AS Ann_PREC_mm FROM watershed_monthly GROUP by YR ) SELECT YR, Ann_PREC_mm, LAG(Ann_PREC_mm, 1) OVER (ORDER BY YR) prev_year_total, ROUND(Ann_PREC_mm *1.0/LAG(Ann_PREC_mm,1) OVER (ORDER BY YR), 2) AS dod FROM annul_total ORDER BY YR
2.4.2 LEAD
Let’s take a look at another example. It’s very similar to the previous one, but instead of appending the previous year’s Ann_PREC_mm, we used the LEAD function with an offset of 1 to get the next year’s Ann_PREC_mm.
We then divided the next year’s Ann_PREC_mm by the current year’s Ann_PREC_mm to get the year-over-year growth.
Notice that last 2 columns of the last two rows are empty. It’s simply because 2010 and 2009 is the first two available years.
%%sql sqlite:// WITH annul_total AS ( SELECT YR, SUM(PREC_mm) AS Ann_PREC_mm FROM watershed_monthly GROUP by YR ) SELECT YR, Ann_PREC_mm, LEAD(Ann_PREC_mm, 2) OVER (ORDER BY YR) prev_year_total, ROUND(Ann_PREC_mm *1.0/LEAD(Ann_PREC_mm,2) OVER (ORDER BY YR), 2) AS dod FROM annul_total ORDER BY YR
Summary
Congratulations! You have followed through all the examples and have seen most of the common WINDOW functions/patterns.
Findow functions perform calculations similar to how aggregation functions do, but with the added benefit that they have access to data within the individual rows, which makes them quite powerful. Window functions, also called analytics functions, are powerful SQL functions that will take you from a beginner SQL user to an intermediate SQL user.
However, for many, since they are so foreign to standard SQL, they are difficult to learn and understand, have strange syntax—and are very often avoided. Writing a bug-free WINDOW function query could be quite challenging. It takes time and practice to become a master, and you are getting there soon.
More tutorials are available at https://github.com/royalosyin/Practice-SQL-with-SQLite-and-Jupyter-Notebook
References and resources
https://sqlite.org/windowfunctions.html
https://learnsql.com/blog/window-functions-vs-aggregate-functions/
https://blog.jupyter.org/a-jupyter-kernel-for-sqlite-9549c5dcf551
https://blog.xojo.com/2018/12/18/sqlite-3-25-adds-window-functions-and-improves-alter-table/
https://towardsdatascience.com/sql-window-analytic-functions-explained-in-4-minutes-6171c6983263