Windowing Functions

Windowing Functions


Windowing functions in PySpark and Spark SQL provide powerful ways to perform calculations against a group, or 'window', of rows related to the current row. They are essential tools for data analysis, particularly when working with ordered datasets. In this blog post, we'll explore five key windowing functions: rank, dense_rank, row_number, lead, and lag.

Let's consider a dataset of sales data with the following columns: Salesperson, Product, and Sales.

| Salesperson | Product | Sales |

|-----------|---------|-------|

| Alice | Apples | 100 |

| Alice | Apples | 100 |

| Alice | Apples | 150 |

| Alice | Apples | 200 |

| Alice | Apples | 200 |

| Alice | Apples | 200 |

| Alice | Apples | 300 |

In this dataset, Alice is selling Apples and some sales are the same which will help us demonstrate the difference between rank, dense_rank, and row_number.

Rank, Dense Rank, and Row Number

The rank function assigns a unique rank to each row within a window partition, with the same rank for rows with equal values. If there is a tie for a position, the subsequent ranks are skipped.

The dense_rank function is similar to rank, but it does not skip ranks if there is a tie.

The row_number function assigns a unique row number to each row within a window partition, regardless of the values in the rows.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank, dense_rank, row_number

windowSpec = Window.partitionBy("Salesperson", "Product").orderBy("Sales")

df.withColumn("rank", rank().over(windowSpec))\

.withColumn("dense_rank", dense_rank().over(windowSpec))\

.withColumn("row_number", row_number().over(windowSpec)).show()

This will output:

| Salesperson | Product | Sales | Rank | Dense Rank | Row Number |

|-----------|---------|-------|-----|--------|---------|

| Alice | Apples | 100 | 1 | 1 | 1 |

| Alice | Apples | 100 | 1 | 1 | 2 |

| Alice | Apples | 150 | 3 | 2 | 3 |

| Alice | Apples | 200 | 4 | 3 | 4 |

| Alice | Apples | 200 | 4 | 3 | 5 |

| Alice | Apples | 200 | 4 | 3 | 6 |

| Alice | Apples | 300 | 7 | 4 | 7 |

In Spark SQL:

SELECT Salesperson, Product, Sales,

rank() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as rank,

dense_rank() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as dense_rank,

row_number() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as row_number

FROM sales

Lead and Lag

The lead function returns the value of a given expression at a specified number of rows after the current row within a window partition. It can be used to compare the current sale with the next sale.

The lag function returns the value of a given expression at a specified number of rows before the current row within a window partition. It can be used to compare the current sale with the previous sale.

from pyspark.sql.functions import lead, lag

windowSpec = Window.partitionBy("Salesperson", "Product").orderBy("Sales")

df.withColumn("next_sale", lead("Sales").over(windowSpec))\

.withColumn("prev_sale", lag("Sales").over(windowSpec)).show()

This will output:

| Salesperson | Product | Sales | Next Sale | Previous Sale |

|-------------- |----------|-------|-----------|---------------|

| Alice | Apples | 100 | 100 | null |

| Alice | Apples | 100 | 150 | 100 |

| Alice | Apples | 150 | 200 | 100 |

| Alice | Apples | 200 | 200 | 150 |

| Alice | Apples | 200 | 200 | 200 |

| Alice | Apples | 200 | 300 | 200 |

| Alice | Apples | 300 | null | 200 |

In Spark SQL:

SELECT Salesperson, Product, Sales,

lead(Sales) OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as next_sale,

lag(Sales) OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as prev_sale

FROM sales;

In conclusion, windowing functions in PySpark and Spark SQL provide powerful ways to perform complex calculations on ordered datasets. By mastering these functions, you can greatly enhance your data analysis capabilities.

#ApacheSpark #DistributedProcessing #DataFrame #BigDataAnalytics #DataEngineering #DataProcessing #sparksql


Nivritti Kolhe

Azure Data engineer || Power BI Developer || Microsoft Certified-Data Analyst Associate | ADF | T-SQL | ETL | Data Migration | Data Modeling | SSIS |SSRS | Power BI | Azure l Crystal Report

4 个月

Thank you Sachin D N ????

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了