Windowing Functions
Sachin D N ????
Data Consultant @ Lumen Technologies | Data Engineer | Big Data Engineer | AWS | Azure | Apache Spark | Databricks | Delta Lake | Agile | PySpark | Hadoop | Python | SQL | Hive | Data Lake | Data Warehousing | ADF
Windowing functions in PySpark and Spark SQL provide powerful ways to perform calculations against a group, or 'window', of rows related to the current row. They are essential tools for data analysis, particularly when working with ordered datasets. In this blog post, we'll explore five key windowing functions: rank, dense_rank, row_number, lead, and lag.
Let's consider a dataset of sales data with the following columns: Salesperson, Product, and Sales.
| Salesperson | Product | Sales |
|-----------|---------|-------|
| Alice | Apples | 100 |
| Alice | Apples | 100 |
| Alice | Apples | 150 |
| Alice | Apples | 200 |
| Alice | Apples | 200 |
| Alice | Apples | 200 |
| Alice | Apples | 300 |
In this dataset, Alice is selling Apples and some sales are the same which will help us demonstrate the difference between rank, dense_rank, and row_number.
Rank, Dense Rank, and Row Number
The rank function assigns a unique rank to each row within a window partition, with the same rank for rows with equal values. If there is a tie for a position, the subsequent ranks are skipped.
The dense_rank function is similar to rank, but it does not skip ranks if there is a tie.
The row_number function assigns a unique row number to each row within a window partition, regardless of the values in the rows.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank, row_number
windowSpec = Window.partitionBy("Salesperson", "Product").orderBy("Sales")
df.withColumn("rank", rank().over(windowSpec))\
.withColumn("dense_rank", dense_rank().over(windowSpec))\
.withColumn("row_number", row_number().over(windowSpec)).show()
This will output:
| Salesperson | Product | Sales | Rank | Dense Rank | Row Number |
|-----------|---------|-------|-----|--------|---------|
| Alice | Apples | 100 | 1 | 1 | 1 |
| Alice | Apples | 100 | 1 | 1 | 2 |
| Alice | Apples | 150 | 3 | 2 | 3 |
| Alice | Apples | 200 | 4 | 3 | 4 |
| Alice | Apples | 200 | 4 | 3 | 5 |
| Alice | Apples | 200 | 4 | 3 | 6 |
领英推荐
| Alice | Apples | 300 | 7 | 4 | 7 |
In Spark SQL:
SELECT Salesperson, Product, Sales,
rank() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as rank,
dense_rank() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as dense_rank,
row_number() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as row_number
FROM sales
Lead and Lag
The lead function returns the value of a given expression at a specified number of rows after the current row within a window partition. It can be used to compare the current sale with the next sale.
The lag function returns the value of a given expression at a specified number of rows before the current row within a window partition. It can be used to compare the current sale with the previous sale.
from pyspark.sql.functions import lead, lag
windowSpec = Window.partitionBy("Salesperson", "Product").orderBy("Sales")
df.withColumn("next_sale", lead("Sales").over(windowSpec))\
.withColumn("prev_sale", lag("Sales").over(windowSpec)).show()
This will output:
| Salesperson | Product | Sales | Next Sale | Previous Sale |
|-------------- |----------|-------|-----------|---------------|
| Alice | Apples | 100 | 100 | null |
| Alice | Apples | 100 | 150 | 100 |
| Alice | Apples | 150 | 200 | 100 |
| Alice | Apples | 200 | 200 | 150 |
| Alice | Apples | 200 | 200 | 200 |
| Alice | Apples | 200 | 300 | 200 |
| Alice | Apples | 300 | null | 200 |
In Spark SQL:
SELECT Salesperson, Product, Sales,
lead(Sales) OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as next_sale,
lag(Sales) OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as prev_sale
FROM sales;
In conclusion, windowing functions in PySpark and Spark SQL provide powerful ways to perform complex calculations on ordered datasets. By mastering these functions, you can greatly enhance your data analysis capabilities.
#ApacheSpark #DistributedProcessing #DataFrame #BigDataAnalytics #DataEngineering #DataProcessing #sparksql
Azure Data engineer || Power BI Developer || Microsoft Certified-Data Analyst Associate | ADF | T-SQL | ETL | Data Migration | Data Modeling | SSIS |SSRS | Power BI | Azure l Crystal Report
4 个月Thank you Sachin D N ????