Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

In the previous tutorials, we discussed aggregate functions that can take a large group of rows and output a single value for them. Often, being able to summarize a group of rows to a single value is important and useful. However, there are times when you want to keep the individual rows as well as gaining a summarizing value. To do this, a new set of functions named window functions come into our field of vision.

PostgreSQL's documentation does an excellent job of introducing the concept of Window Functions:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

1. Window Functions within SQLite

Window Functions get available since SQLite 3.25, which includes these built-in Window functions:

  • row_number()
  • rank()
  • dense_rank()
  • percent_rank()
  • cume_dist()
  • ntile(N)
  • lag(expr), lag(expr, offset), lag(expr, offset, default)
  • lead(expr), lead(expr, offset), lead(expr, offset, default)
  • first_value(expr)
  • last_value(expr)
  • nth_value(expr, N)

Previous tutorials are created based on a little bit older SQLite versions. At that time, window Functions are not available. In this tutorial, I am going to touch some of them in the following 4 sections:

  • First, go through a few WINDOW functions based on regular aggregate functions, such as AVGMIN/MAXCOUNTSUM.
  • Second, focus on rank-related functions, such as ROW_NUMBERRANK,and RANK_DENSE. Those functions are handy when generating ranking indexes, and you need to be fluent in those functions before entering a data scientist SQL interview.
  • Third, talk about generating statistics (e.g., percentiles, quartiles, median, etc .) with the NTILE function, a common task for a data scientist.
  • Forth, focus on LAG and LEAD, two functions that are super important if you are interviewing for a role that requires dealing with time-series data.

We presume you already set up the test environment following the previous tutoiral. However, we use ipython-sql Magic functions to query a sqllite database as it is more straightforward and concise. ipython-sql introduces a %sql (or %%sql) magic to notebooks allowing you to connect to a database, using SQLAlchemy connect strings, then issue SQL commands within IPython or Jupyter/IPython Notebook. However, if you do not use the %%sql magic in your Jupyter notebook, the output of your SQL queries will be just a plain list of tuples. A better way to work with the result sets returned is to draw them as a table with the headers. This is where the IPython SQL magic gets very handy.

No worries, all tutorials and demonstration data are available at the link on Github.

%load_ext sql

%sql sqlite:///data/demo.db3

2. Practice Window Functions

The following is the basic syntax of a window function:

SELECT 
  {columns}, {window_func} OVER (PARTITION BY {partition_key} ORDER BY {order_key}) 
FROM table1;

Where {columns} are the columns to retrieve from tables for the query, {window_func} is the window function you want to use, {partition_key} is the column or columns you want to partition on (more on this later), {order_key} is the column or columns you want to order by, and table1 is the table or joined tables you want to pull data from. The OVER keyword indicates where the window definition starts.

Let's take the table of watershed_monthly in the sqlite database to practice these window functions.

2.1 The usual suspects: AVG, MIN/MAX, SUM, and COUNT

When using window functions, you can apply the same aggregates that you would under normal circumstances—AVG, MIN/MAX, SUM, and COUNT. The easiest way to understand these is to re-run the previous example with some additional functions.

2.1.1 AVG

SELECT
    YR,
    MO,
    PREC_mm,
    ROUND(AVG(PREC_mm) OVER(PARTITION BY MO)) AS avg_PREC_mm
FROM watershed_monthly
ORDER by MO
LIMIT 12
Done.

Out:

No alt text provided for this image

There is no GROUP BY clause for the AVG function, but how does the SQL engine know which rows to use to compute the average? The answer is the PARTITION BY clause inside the OVER() utility, and we are calculating the average based on a unique value of rating.

In the final output, every row has the average PREC_mm from the same month (MO). By adding a ORDER by, you will find there is the same average value for each month.

2.1.2 SUM

Let’s take a look at a more complicated example, where we calculated a running sum with a window function.

%%sql sqlite://
SELECT
    YR,
    MO,
    PREC_mm,    
    SUM(PREC_mm) OVER(ORDER BY MO) AS running_total,
    SUM(PREC_mm) OVER() AS overall,
    ROUND(SUM(PREC_mm) OVER(ORDER BY MO) * 100.0 /SUM(PREC_mm) OVER(), 2) AS running_percentage
FROM watershed_monthly
LIMIT 12
No alt text provided for this image

Notice that there is no PARTITION BY clause because I am not grouping those into any PREC_mm sub-categories. I want to compute my overall progress but not based on any subgroups or categories.

Another thing to notice is that if I don’t add anything inside the OVER() function, I get the total number of PREC_mm from the entire table. As you can see from the second last column: they all have the same value of 24084.34, but after I add the ORDER BY clause, I get the running total of the PREC_mm up to that specific row (running_total column).

2.1.3 Others

You can try MINCOUNTAVG by yourself, which are all similar to previous examples of SUM and MAX.

2.2 ROW_NUMBER, RANK, DENSE_RANK

Let’s go through some of the essential WINDOW functions: ROW_NUMBER and RANK. In the following example, our goal is to create a ranking index based on the amount of PREC_mm for the entire table.

2.2.1 ROW_NUMBER

%%sql sqlite://
SELECT
    YR,
    MO,
    PREC_mm,  
    ROW_NUMBER() OVER(ORDER BY PREC_mm DESC) AS row_num
FROM watershed_monthly
ORDER by row_num
LIMIT 12
No alt text provided for this image

As you can see, the ROW_NUMBER function generates a sequence of integers, starting from 1, for each row.

2.2.2 ROW_NUMBER over PARTITION

Let’s take a look at another example. Instead of comparing a amount of PREC_mm to all other PREC_mm from the entire table, we can rank them within each month using PARTITION BY.

%%sql sqlite://
SELECT
    YR,
    MO,
    PREC_mm,  
    ROW_NUMBER() OVER(PARTITION BY MO ORDER BY PREC_mm DESC) AS row_num
FROM watershed_monthly
ORDER by row_num
No alt text provided for this image

ROW_NUMBER with OVER and PARTITION BY is a regular pattern that is frequently used in advanced SQL.

2.2.3 Others

Now you can try the RANK and DENSE_RANK functions by yourself as they are very similar to ROW_NUMBER. The difference between RANK and ROW_NUMBER is that RANK assigns the same unique values if there is a tie and restarts the next value with the total number of rows up to that row.

DENSE_RANK. It is very similar to RANK but differs in how it handles ties. It restarts with the following immediate consecutive value rather than creating a gap.

2.3 NTILE

You can use window functions to identify what percentile (or quartile, or any other subdivision) a given row falls into. The syntax is NTILE(# of buckets). In this case, ORDER BY determines which column to use to determine the quartiles (or whatever number of 'tiles you specify).

NTILE takes an argument of the number of buckets and then creates this number of buckets as equally as possible, based on how the rows are partitioned and ordered inside the OVER function.

%%sql sqlite://
SELECT
    YR,
    MO,
    PREC_mm,  
    NTILE(100) OVER(ORDER BY PREC_mm) AS percentile
FROM watershed_monthly
ORDER BY percentile
No alt text provided for this image

You can create a few more statistics, such as DECILES (10 buckets) and QUARTILES (4 buckets).


2.4 LAG and Lead

It can often be useful to compare rows to preceding or following rows, especially if you've got the data in an order that makes sense. You can use LAG or LEAD to create columns that pull values from other rows—all you need to do is enter which column to pull from and how many rows away you'd like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows.

Their syntax is very similar to other window functions. Instead of focusing on the format of the syntax, let me show you a couple of examples.

2.4.1 LAG

  1. In the first step, we created a temporal table of ann_total with PREC_mm.
  2. And in the second step, we appended the previous year’s Ann_PREC_mm to the current year’s using the LAG function.
  3. Notice that last 2 columns of the first row are empty. It’s simply because 1981 is the first available year.
  4. We also specified the offset, which is 1, so we fetch the next row. If you change this number to 2, then you compare the current year’s Ann_PREC_mm to the year before the previous year.
  5. Finally, we divided the current year’s Ann_PREC_mm by the previous year’s to create our year Ann_PREC_mm growth.
%%sql sqlite://
WITH annul_total AS (
    SELECT
        YR,    
        SUM(PREC_mm) AS Ann_PREC_mm    
    FROM watershed_monthly
    GROUP by YR
)

SELECT
    YR,
    Ann_PREC_mm,
    LAG(Ann_PREC_mm, 1) OVER (ORDER BY YR) prev_year_total,
    ROUND(Ann_PREC_mm *1.0/LAG(Ann_PREC_mm,1) OVER (ORDER BY YR), 2) AS dod
FROM annul_total
ORDER BY YR
No alt text provided for this image

2.4.2 LEAD

Let’s take a look at another example. It’s very similar to the previous one, but instead of appending the previous year’s Ann_PREC_mm, we used the LEAD function with an offset of 1 to get the next year’s Ann_PREC_mm.

We then divided the next year’s Ann_PREC_mm by the current year’s Ann_PREC_mm to get the year-over-year growth.

Notice that last 2 columns of the last two rows are empty. It’s simply because 2010 and 2009 is the first two available years.

%%sql sqlite://
WITH annul_total AS (
    SELECT
        YR,    
        SUM(PREC_mm) AS Ann_PREC_mm    
    FROM watershed_monthly
    GROUP by YR
)

SELECT
    YR,
    Ann_PREC_mm,
    LEAD(Ann_PREC_mm, 2) OVER (ORDER BY YR) prev_year_total,
    ROUND(Ann_PREC_mm *1.0/LEAD(Ann_PREC_mm,2) OVER (ORDER BY YR), 2) AS dod
FROM annul_total
ORDER BY YR
No alt text provided for this image

Summary

Congratulations! You have followed through all the examples and have seen most of the common WINDOW functions/patterns.

Findow functions perform calculations similar to how aggregation functions do, but with the added benefit that they have access to data within the individual rows, which makes them quite powerful. Window functions, also called analytics functions, are powerful SQL functions that will take you from a beginner SQL user to an intermediate SQL user.

However, for many, since they are so foreign to standard SQL, they are difficult to learn and understand, have strange syntax—and are very often avoided. Writing a bug-free WINDOW function query could be quite challenging. It takes time and practice to become a master, and you are getting there soon.

More tutorials are available at https://github.com/royalosyin/Practice-SQL-with-SQLite-and-Jupyter-Notebook

References and resources

https://sqlite.org/windowfunctions.html

https://learnsql.com/blog/window-functions-vs-aggregate-functions/

https://blog.jupyter.org/a-jupyter-kernel-for-sqlite-9549c5dcf551

https://blog.xojo.com/2018/12/18/sqlite-3-25-adds-window-functions-and-improves-alter-table/

https://sqlpad.io/playground/

https://towardsdatascience.com/sql-window-analytic-functions-explained-in-4-minutes-6171c6983263

要查看或添加评论,请登录

Chonghua Yin的更多文章

社区洞察

其他会员也浏览了