登录查看更多内容

Window Functions in SQL

Vedant M.

Analytics @ Target ??

发布日期: 2022年5月26日

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, the use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result. At the core of any window function, there is a set of records, called the window frame, defined using an OVER clause. Knowing what records are in the window frame, how they are ordered, and what their upper and lower bounds are, are critical in understanding how window functions work.

Let's suppose we need to Find the Department wise highest salary. The basic way is to use Group By to fetch the required data.

select dept_name, max(salary) as "Max Salary in each Dept"
from employee group by dept_name.

As you can see in the below output, the number of rows has been reduced to a number of departments in the table.

What if we want all the information along with the highest salary in each department?? Here We can use Window Functions.

select *, 
max(salary) over(partition by dept_name order by salary desc)
as "Max Salary in each Dept" from employee

A simple way to create a window frame is by using an OVER clause with a PARTITION BY sub-clause. we can use an ORDER BY sub-clause to order the rows inside a window frame. Having the window frame ordered by some criteria allows us to use analytic window functions like LEAD(), LAG(), and FIRST_VALUE(), among others.

Types of Window Functions:

ROW_NUMBER( ): As the name suggests, it assigns a unique value to each of the records in the table.
RANK( ): Rank is similar to row_number( ) but in the case of duplicates, all the duplicate records get the same rank and for every duplicate rank it skips a value.
DENSE_RANK( ): Dense rank is similar to rank( ) but for every duplicate rank, it does not skip a value.

difference between rank, denserank and rownumber

select *,
row_number() over(partition by dept_name order by salary desc) as rn,
rank() over(partition by dept_name order by salary desc) as rnk,
dense_rank() over(partition by dept_name order by salary desc) as "dense_rnk"
from employee,

4. Lead( ) / Lag( ): Lead and lag are used to compare a row with another row based on a particular column. Lead is used to compare the row following the current row, and lag is used to compare the row preceding the current row.

Lead and Lag both take 3 arguments first being the column, for comparison second being the number of rows following/preceding the current row. and third, is the default value instead of [null].

select *, 
lead(salary,1,0) over(partition by dept_name order by emp_id) as "Lead",
lag(salary,1,0) over(partition by dept_name order by emp_id) as "Lag"
from employee

Window Frame Bounds With ROWS or RANGE

A very interesting feature of the OVER clause is the ability to specify the upper and lower bounds of a window frame. These bounds can be specified by using one of the two subclauses in the OVER clause: ROWS or RANGE

FRAME: It is a subset of Partition. Whenever you are using a window function most important thing is, it creates a window/ partition and it applies the window function to each of these partitions. Frame Clause is useful while using aggregate functions like AVG(), MIN(), etc.

领英推荐

Subqueries and joining tables - your indispensable SQL…

'ESCENT 1 年前

Date Functions in SQL Part I

iBridge360 1 年前

LeetCode Curated SQL Solutions and Discussion - Week 2

Ankur Ranjan 2 年前

The default frame used by many of the query languages is range between unbounded preceding and current row. Now, what is unbounded preceding? It is the first row of the partition whereas the current row is the row currently pointing to.

5. FIRST_VALUE( ): To extract a value/column value from the very first record within the Partition.

6. LAST_VALUE( ): To Extract a value/column value from the last record within the partition. To use Last_Value make sure you are using Frame Clause as range between unbounded preceding and unbounded following.

7. NTH_VALUE( ): To Extract a value/column value from the Nth record within the partition.

NOTE: First_Value, Last_Value, and Nth_Value, all take an argument which is the value to be displayed in the column. Nth_Value takes one more additional argument which is the value of N and N must be a positive integer.

select *, 
first_value(emp_name) over(partition by dept_name order by salary desc)
as "Employee with highest salary" ,last_value(emp_name) over(partition by dept_name order by salary desc range between unbounded preceding and unbounded following) as
"Employee with lowest salary"
from employee,

8. NTILE( ): It is used to group data together within the partition and then place them into certain buckets. SQL makes sure that each bucket has an equal number of records. NTILE takes a positive integer as an argument which tells the number of buckets needed to divide the table into.

Let's say we want to divide the table into 3 parts Top Performers, High Performers, and Average Performers based on the Salary in each Department.

select emp_name, dept_name, a.bucket case when a.bucket=1 then 'Top performer'when a.bucket=2 then 'High performer'when a.bucket = 3 then 'Average Performer'
?else 'Invalid bucket' end as "Ranking" from ( select *, ntile(3) over (partition by dept_name order by salary desc) as bucket
from employee ) a

9. CUME_DIST( ): It is a cumulative distribution function, to identify the distribution percentage of each record with respect to all the records within the result set. It always returns values greater than 0, and the highest value is 1.

For Example: To find Employees constituting the first 30% of all Employees based on Salary.

10. PERCENT_RANK( ): It is used to find the relative rank of each row in the form of a percentage. It always returns values greater than 0, and the highest value is 1. It does not count any NULL values and returns 0 for NULL values.

For Example: To find the Top 10% of employees based on salary.

select *, case when a.rank_percentage >10 then 'Yes else 'No' end as 'In Top 10',
case when a.cummulative_dist >30 then 'Yes' else 'No' end as 'Constituting >30 %' from ( select *,
round(cume_dist() over(partition by dept_name order by salary) *100,2)
as cummulative_dist,
round(percent_rank() over(partition by dept_name order by salary) * 100,2)
as rank_percentage from employee ) a