SQL Window Function
Pintu Kumar Kushwaha
?? Data Science & Business Analyst Intern @The Spark Foundation ?? Data Analyst | Excel | SQL | Tableau | Power BI | Python ?? | Snowflake?? |
Window functions in SQL are a type of analytical function that perform calculations across a set of rows that are related to the current row, called a "window".
A window function calculates a value for each row in the result set based on a subset of the rows that are defined by a window specification.
The window specification is defined using the OVER() clause in SQL, which specifies the partitioning and ordering of the rows that the window function will operate on. The partitioning divides the rows into groups based on a specific column or expression, while the ordering defines the order in which the rows are processed within each group.
Window functions applies aggregate and ranking functions over a particular window (set of rows). OVER clause is used with window functions to define that window. OVER clause does two things :?
Note:?If partitions aren’t done, then ORDER BY orders all rows of table.?
Syntax:??
SELECT coulmn_name1,
window_function(cloumn_name2)
OVER([PARTITION BY column_name1] [ORDER BY column_name3]) AS new_column
FROM table_name;
window_function= any aggregate or ranking function
column_name1= column to be selected
coulmn_name2= column on which window function is to be applied
column_name3= column on whose basis partition of rows is to be done
new_column= Name of new column
table_name= Name of table
Window functions can only be used in the?SELECT?or?ORDER BY?clauses of a SQL query. In a window function, all the rows returned by the?SELECT?are divided into windows (or frames). The criteria for deciding which rows belong to which window is defined by putting a?PARTITION BY?subclause inside the window function invocation. The window function then?computes?one result based on all the rows?in one window.
List of Window Functions
Let's create a table marks:
CREATE DATABASE testdb;
use testdb;
CREATE TABLE marks (
?student_id INTEGER PRIMARY KEY AUTO_INCREMENT,
? ? name VARCHAR(255),
? ? branch VARCHAR(255),
? ? obtained_mark INTEGER
);
INSERT INTO marks (name,branch,obtained_mark)VALUES?
('Nitish','EEE',82),
('Rishabh','EEE',91),
('Anukant','EEE',69),
('Rupesh','EEE',55),
('Shubham','CSE',78),
('Ved','CSE',43),
('Deepak','CSE',98),
('Arpan','CSE',95),
('Vinay','ECE',95),
('Ankit','ECE',88),
('Anand','ECE',81),
('Rohit','ECE',95),
('Prashant','MECH',75),
('Amit','MECH',69),
('Sunny','MECH',39),
('Gautam','MECH',51);
SELECT * FROM marks;
Now find the average marks by branch, we can get it by applying group by
SELECT
? ? branch, AVG(obtained_mark) AS branch_avg
FROM
? ? marks
GROUP BY branch;?
Output:
The same result we can be obtained by using the window function :
select *,
? ? ? ?avg(obtained_mark) over(partition by branch) 'branch_avg'?
from marks
order by student_id;?
Output:
GROUP BY vs Window Functions
When comparing window functions and?GROUP BY, it’s essential to remember that?GROUP BY?collapses the individual records into groups; after using GROUP BY, you cannot refer to any individual field because it is collapsed.?
Window functions are a powerful feature of SQL. They allow us to apply functions like?AVG,?COUNT,?MAX, and?MIN?on a group of records while still leaving the individual records accessible. Since the individual records are not collapsed, we can create queries showing data from the individual record together with the result of the window function. This is what makes window functions so powerful.
Suppose we want to obtain a list of student_id, student names, branch, and calculate the average of the "obtained_mark" column across all rows in the "marks" table.
select *,
? ? ? ?avg(obtained_mark) over() 'overall_avg'?
from marks;
Explanation:
This SQL query uses a window function to calculate the average of the "obtained_mark" column across all rows in the "marks" table. Let's break down the query step by step:
So, the combined query retrieves all columns from the "marks" table along with an additional column that contains the average of the "obtained_mark" column across all rows. This average value will be repeated for every row because there's no specific partitioning or ordering specified for the window function.
Let's say want to retrieve all columns from the "marks" table and include an additional column showing the average obtained marks for each branch.
select *,
avg(obtained_mark) over(partition by branch) 'branch_avg'?
from marks;?
Explanation:
Output:
Let's say want to retrieve all columns from the "marks" table and include two additional columns: one showing the minimum obtained marks across all students and another showing the maximum obtained marks across all students.
select *,
min(obtained_mark) over() min_marks,
max(obtained_mark) over() max_marks
from marks;
Explanation:
Output:
Let's say we want to retrieve all columns from the "marks" table and include two additional columns for each row: one showing the minimum obtained marks within the same branch and another showing the maximum obtained marks within the same branch.
SELECT *,
? ? ? ?MIN(obtained_mark) OVER(PARTITION BY branch) AS branch_min_marks,
? ? ? ?MAX(obtained_mark) OVER(PARTITION BY branch) AS branch_max_marks
FROM marks;
Explanation:
Output:
You want to retrieve all columns from the "marks" table and include several additional columns for each row: overall average obtained marks, average obtained marks within the same branch, minimum obtained marks across all students, maximum obtained marks across all students, minimum obtained marks within the same branch, and maximum obtained marks within the same branch. The result set should be ordered by student ID.
SQL Query:
SELECT *,
AVG(obtained_mark) OVER() AS overall_avg,
AVG(obtained_mark) OVER(PARTITION BY branch) AS branch_avg,
MIN(obtained_mark) OVER() AS overall_min_marks,
MAX(obtained_mark) OVER() AS overall_max_marks,
MIN(obtained_mark) OVER(PARTITION BY branch) AS branch_min_marks,
MAX(obtained_mark) OVER(PARTITION BY branch) AS branch_max_marks
FROM marks
ORDER BY student_id;
Explanation:
The resulting dataset will include the original columns from the "marks" table along with the calculated columns for overall and branch-specific averages, minimum and maximum marks for both the entire dataset and within each branch. The result set will be ordered by student ID.
Output:
Aggregate Window Function :?
Various aggregate functions such as SUM(), COUNT(), AVERAGE(), MAX(), MIN() applied over a particular window (set of rows) are called aggregate window functions.?
Aggregate Function with OVER()
Find all the students who have marks higher than the avg marks of
their respective branch
SELECT *
FROM (
? SELECT *,
? ? ? ? ?AVG(obtained_mark) OVER(PARTITION BY branch) AS branch_avg
? FROM marks
) AS t
WHERE obtained_mark > t.branch_avg
ORDER BY student_id;
Explanation:
Overall, the query effectively finds and lists all students whose marks are higher than the average marks of their respective branches. It combines the power of window functions and subqueries to achieve this filtering based on branch averages.
Output:
Find all the students who have marks lower than the avg marks of
their respective branch
-- Find all the students who have marks lower than the avg marks of? their respective branch
select * from
(select *,
avg(obtained_mark) over(partition by branch) as branch_avg
from marks) as t
where obtained_mark < t.branch_avg
order by student_id;
?
Example –?
Find average salary of employees for each department and order employees within a department by age.?
SELECT Name, Age, Department, Salary,
AVERAGE(Salary) OVER( PARTITION BY Department) AS Avg_Salary
FROM employee;
OR
SELECT *,
AVERAGE(Salary) OVER( PARTITION BY Department) AS Avg_Salary
FROM employee;
Let’s consider another case:?
SELECT Name, Age, Department, Salary,
AVERAGE(Salary) OVER( PARTITION BY Department ORDER BY Age) AS Avg_Salary
FROM employee
Here we also order the records within the partition as per age values and hence the average values change as per the sorted order.
The output of above query will be :??
Ranking Window Functions :?
Ranking functions are, RANK(), DENSE_RANK(), ROW_NUMBER()?
Note –?
ORDER BY() should be specified compulsorily while using rank window functions.?
Example –?
Calculate row no., rank, dense rank of employees is employee table according to salary within each department.?
SELECT
ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC)
AS emp_row_no, Name, Department, Salary,
RANK() OVER(PARTITION BY Department
ORDER BY Salary DESC) AS emp_rank,
DENSE_RANK() OVER(PARTITION BY Department
ORDER BY Salary DESC)
AS emp_dense_rank,
FROM employee
The output of above query will be :??
Write a SQL query to calculate the rank of each student based on their obtained marks in descending order.
SELECT *,
? ? ? ?RANK() OVER(ORDER BY obtained_mark DESC) AS rank
FROM marks;
Explanation:
领英推荐
Output:
Write a SQL query calculate the rank of each student based on their obtained marks within their respective branches, with the ranking determined in descending order of obtained marks.
SELECT *,
? ? ? ?RANK() OVER(PARTITION BY branch ORDER BY obtained_mark DESC) AS 'rank'
FROM marks;
Explanation:
SELECT *,
? ? ? ?RANK() OVER(PARTITION BY branch ORDER BY obtained_mark DESC) AS 'rank',
? ? ? ?DENSE_RANK() OVER(PARTITION BY branch ORDER BY obtained_mark DESC) AS 'dense_rank'
FROM marks;
Explanation:
Output:
In this example, both rank and dense_rank are calculated based on the students' obtained marks within their respective branches. The ranking in the rank column leaves gaps in ranking for tied marks, whereas the dense_rank column assigns consecutive ranks even for tied marks.
Both RANK() and DENSE_RANK() are window functions in SQL used to assign rankings to rows within result sets. However, they differ in how they handle tied or duplicate values. Here's a brief explanation of the key differences:
In summary:
The choice between these functions depends on your specific ranking requirements and how you want to handle tied values in your result set.
Example:
SELECT *,
? ? ? ?RANK() OVER(ORDER BY obtained_mark DESC) AS rank,
? ? ? ?DENSE_RANK() OVER(ORDER BY obtained_mark DESC) AS dense_rank
FROM marks;
Window Frame
A?window frame?is a set of rows that are somehow related to the current row. The window frame is evaluated separately within each partition.
A frame in a window function is a subset of rows within the partition that
determines the scope of the window function calculation. The frame is defined
using a combination of two clauses in the window function: ROWS and BETWEEN.
The ROWS clause specifies how many rows should be included in the frame
relative to the current row. For example, ROWS 3 PRECEDING means that the
frame includes the current row and the three rows that precede it in the partition.
The BETWEEN clause specifies the boundaries of the frame.
The bounds can be any of the five options:
The?lower_bound?must be BEFORE the?upper_bound.
Default Window Frame
Examples
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW : means that the frame includes all rows from the beginning of the partition up to and including the current row.
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: The frame includes the
current row and the row immediately before and after it.
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: The frame includes all rows in the partition.
ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING: The frame includes the
current row and the three rows before it and the two rows after it.
Analytic Window Functions
Let's try to find the first_value in our marks table i.e. find the top-performing student's name based on the highest obtained marks in descending order.
SELECT *,
? ? ? ?FIRST_VALUE(name) OVER(ORDER BY obtained_mark DESC) AS 'Topper'
FROM marks;
Explanation:
Let's try to find the first_value in our marks table i.e. find the top obtained marks for each student based on the highest marks in descending order.
SELECT *,
? ? ? ?FIRST_VALUE(obtained_mark) OVER(ORDER BY obtained_mark DESC) AS 'Topper'
FROM marks
ORDER BY student_id;
Find the lowest performing student's name based on the lowest obtained marks in descending order, considering all rows using the last_value window function.
SELECT *,
? ? ? ?LAST_VALUE(name) OVER(ORDER BY obtained_mark DESC?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS 'low_performer'
FROM marks
ORDER BY student_id;
Explanation:
Note:?You usually want to use?RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING?with?last_value(). With the default window frame for?ORDER BY,?RANGE UNBOUNDED PRECEDING,?last_value()?returns the value for the current row.
Find the lowest obtained marks for each student based on the lowest marks in descending order, considering all rows.
SELECT *,
? ? ? ?LAST_VALUE(obtained_mark) OVER(ORDER BY obtained_mark DESC?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS 'low_performer'
FROM marks
ORDER BY student_id;
Let's try to find the second topper in our marks table.
Uses the nth_value() window function to find the name of the second-top performing student based on the highest obtained marks in descending order, considering all rows.
SELECT *,
? ? ? ?NTH_VALUE(name, 2) OVER(ORDER BY obtained_mark DESC?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS 'second_topper'
FROM marks;
Explanation:
Uses the nth_value() window function to find the name of the second-top performing student within each branch based on the highest obtained marks in descending order, considering all rows within each branch.
SELECT *,
? ? ? ?NTH_VALUE(name, 2) OVER(PARTITION BY branch ORDER BY obtained_mark DESC?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS 'branch_second_topper'
FROM marks;
Explanation:
Uses the nth_value() window function to find the name of the second lowest-performing student within each branch based on the lowest obtained marks in ascending order, considering all rows within each branch.
SELECT *,
? ? ? ?NTH_VALUE(name, 2) OVER(PARTITION BY branch ORDER BY obtained_mark?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS 'branch_second_low_performer'
FROM marks;
Find the branch toppers (Name, marks, branch)
Write a SQL query is designed to find the branch toppers for each branch in the "marks" table. It uses the first_value() window function to identify the top-performing student and their obtained marks within each branch.
SELECT name, branch, obtained_mark
FROM (
? ? SELECT *,
? ? ? ? ? ?FIRST_VALUE(name) OVER(PARTITION BY branch ORDER BY obtained_mark DESC) AS 'branch_topper',
? ? ? ? ? ?FIRST_VALUE(obtained_mark) OVER(PARTITION BY branch ORDER BY obtained_mark DESC) AS 'branch_topper_marks'
? ? FROM marks
) AS t
WHERE t.branch_topper_marks = obtained_mark AND t.branch_topper = name;
Explanation:
You can get the same result by using rank(), dense_rank() and simply use of group by and max() function.
--USING GROUP BY
select t.name, t.branch, t.obtained_mark from marks t
join
(select branch, max(obtained_mark) as max_marks from marks
group by branch) m?
where t.obtained_mark = m.max_marks and t.branch = m.branch;
--USING RANK()
select * from
(select *,
rank() over(partition by branch order by obtained_mark desc) as topper_rank
from marks) t?
where topper_rank = 1;
--USING DENSE_RANK()
select name, branch, obtained_mark, avg_marks from
(select *,
dense_rank() over(partition by branch order by obtained_mark desc) as topper_rank,
avg(obtained_mark) over(partition by branch) as avg_marks
from marks) as t
where topper_rank = 1;
-- USING ROW_NUMNER
select * from
(select *,
row_number() OVER(partition by branch order by obtained_mark desc) as topper_rank
from marks) as m
where topper_rank=1
?? Data Science & Business Analyst Intern @The Spark Foundation ?? Data Analyst | Excel | SQL | Tableau | Power BI | Python ?? | Snowflake?? |
1 年In SQL, a window function is a type of function that performs a calculation across a set of rows that are related to the current row. It allows you to perform calculations on a "window" of rows, which can be defined by a specific range or partition. Window functions are typically used to calculate aggregated values or perform calculations that require access to multiple rows within a query result set. They are especially useful when you need to calculate running totals, rankings, or cumulative sums. The syntax for using a window function is as follows: ------------------------------------------------------------------------------- SELECT column1, column2, ..., window_function() OVER (PARTITION BY columnX ORDER BY columnY) FROM table_name; -------------------------------------------------------------------------------
?? Data Science & Business Analyst Intern @The Spark Foundation ?? Data Analyst | Excel | SQL | Tableau | Power BI | Python ?? | Snowflake?? |
1 年SQL Window Functions Basics Before we dive into?ranking functions in SQL, it’s worth noting that all analytical functions share some basic rules: The processing of the analytical function is done?after?the?GROUP BY?processing of the SQL statement and?before?the ordering of the returned rows. Analytic function execution organizes results into partitions and then computes functions over these partitions in a specified order. Execution of the analytical function occurs?after row filtering?(WHERE)?and grouping?(GROUP BY) so the calculations can be?performed on the?GROUP BY?results.
?? Data Science & Business Analyst Intern @The Spark Foundation ?? Data Analyst | Excel | SQL | Tableau | Power BI | Python ?? | Snowflake?? |
1 年OVER(ORDER BY) Unlike the ordinary?ORDER BY?used commonly in a?SELECT?statement, the?ORDER BY?clause in window functions does not order the rows in the output. Instead,?ORDER BY?here means the order of the data over which the window function is executed. OVER(PARTITION BY) The?PARTITION BY?clause in window functions is used to define how rows of data are divided into groups. It’s similar to the?GROUP BY?clause, but?PARTITION BY?does not collapse the individual rows when grouping the data. If you want to use a?PARTITION BY?clause, it has to be part of an?OVER()?clause. OVER(PARTITION BY ORDER BY) "OVER(PARTITION BY ORDER BY)" is a powerful SQL clause that allows us to perform calculations within specific partitions of your data and order those partitions. It's a great tool for advanced data analysis!
?? Data Science & Business Analyst Intern @The Spark Foundation ?? Data Analyst | Excel | SQL | Tableau | Power BI | Python ?? | Snowflake?? |
1 年Aggregate Functions avg(expr)?- average value for rows within the window frame count(expr)?- count of values for rows within the window frame max(expr)?- maximum value within the window frame min(expr)?- minimum value within the window frame sum(expr)?- sum of values within the window frame ORDER BY and Window Frame:?Aggregate functions do not require an?ORDER BY. They accept window frame definition (ROWS,?RANGE,?GROUPS).
?? Data Science & Business Analyst Intern @The Spark Foundation ?? Data Analyst | Excel | SQL | Tableau | Power BI | Python ?? | Snowflake?? |
1 年ntile(n)