"Key Concepts, to Master Window Functions"?

"Key Concepts, to Master Window Functions"

Introduction

If work with data, window functions can significantly level up your SQL skills. If you have ever thought

window functions are confusing

How do I perform calculations that use information from other rows, in SQL

then the post is for you. We go over 6 key concepts to master the window function.

6 Key Concepts

A Partition refers to a set of rows that have the same values in one or more columns. These column(s) are specified using the?PARTITION BY?clause as we will see in the examples.

1. When to Use

Window functions are useful when you must

1.???Rank rows based on a certain column(s) within each partition in the table.

2.???Label numerical values within each partition into buckets based on percentile.

3.???Identify the first (or second or last) event within a specific partition.

4.???Calculate the rolling average/mean.

General uses of window functions are when

1.???A calculation is needed to be performed on a set of rows (defined by partition columns) and keep the result at row level. If we use?group by,?we would have to use aggregation functions on any columns that are not part of the?group by?clause.

2.???Need to perform calculations based on a rolling window.

The clickstream table contains?login, click, buy, sell, return, and logout?events from our app. The query below?orders the events per userId & sessionId, based on the event creation (datetimeCreated) time.

No alt text provided for this image

2. Partition By

Partition By?clause defines the column(s) that groups related rows. This set of rows is referred to as a partition. In our case, we partition by?userId, sessionId?which groups rows, as shown below.

No alt text provided for this image

3. Order By

Within a partition, we can optionally define the order of rows using the?Order By?clause. Here we order the rows within a partition in reverse chronological order based on event created time (datetimeCreated).

No alt text provided for this image

4. Function

This is defined before the?OVER (PARTITION BY ..). This function will be applied to the rows within a partition. The available window functions depend on your database, e.g., in PostgreSQL we have?these window functions .

In our example, we use?ROW_NUMBER?to get the order in which the events were created per user session.

No alt text provided for this image

5. Lead and Lag

These can be used to perform calculations based on data from other rows. Lead and Lag are used to access data from rows after or before the current row respectively. The rows can be ordered using the?order by?clause.

Lead and lag can be used to calculate the time difference between events for a given user session (partition). In the example below, we use lead and lag to get the time that the next and previous events occur during a user session

No alt text provided for this image
No alt text provided for this image

6. Rolling Window

We can use window functions without a?PARTITION BY?clause to simulate a rolling window over all the rows Let’s say we want to find the number of?buy?events within the last 5 events across all users, exclusive of the current event, then we do the following.

No alt text provided for this image
No alt text provided for this image

You can see from the image that the window starts from the?5 PRECEDING?rows and stops before the current row, which is the?1 PRECEDING?row. This?num_purchases?will be calculated for each row as seen in the result set above.

Let’s write a query to check if one of the current, previous, or next events was a?buy?event

No alt text provided for this image
No alt text provided for this image

You can also use this custom window ranges within a specific partition.

Efficiency Considerations

Window functions can be expensive, use?EXPLAIN ?to see the query plan. This will help when using window functions in low latency situations.

For example, if you want to only get the row with the latest event. It might be beneficial to use another technique, such as a group by shown below

No alt text provided for this image
No alt text provided for this image

The cost shown in the red box in the above images represents the cost of the query(“cost units are arbitrary, but conventionally mean disk page fetches” - pg. docs). The lower the cost the faster your query execution.

Conclusion

The next time you come across a particularly complex query, which involves having to group but keep the same granularity, or calculate metrics based on values from other rows use window functions. Hope this article gives you a good idea of when to use window functions, what they are, what they can do and the performance considerations to be aware of while using them.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了