登录查看更多内容

Window and windowing functions using DataFrame API in Spark

Navneet Singh G.

Azure Databricks ? Data Engineer ? PySpark ? SQL ? IaC ? Terraform ? Cloud Automation ? Agile Methodology ? AWS Cloud

发布日期: 2025年3月15日

+ 关注

In this article we will clear the below concepts:

Window Class, and
Window Functions

Such as:

rank
dense_rank
row_number

We will be using DataFrame spark API to demonstrate this.

What is Window class?

The Window class in PySpark is used to define window specifications for window functions.

Specifications such as:

partition by ( to group the data and to apply window functions on each partition )
order by ( to sort the partition )
rowsBetween ( to define window size )

To demonstrate this consider you a DataFrame orders_df with columns:

country | weeknum | numinvoices | totalquantity | invoicevalue

Window class Usage:

%python

from pyspark.sql import Window

myWindow = Window.partitionBy("country") \

.orderBy("weeknum") \

.rowsBetween(Window.unboundedPreceding, Window.currentRow)

Frame boundary and available options:

Frame boundary define the subsets of rows within the partition to which window function is applied. We have the following options:

Window.unboundedPreceding - refers to first row in the partition.
Window.unboundedFollowing - refers to the last row in the partition.
Window.currentRow- refers to the current row.

We have explained the window class. At this stage we are good to apply the window functions.

What is a Window Function ?

Window functions in Apache Spark operate on a group of rows (referred to as a window) and calculate a return value for each row based on the group of rows.

Common Window Functions

rank()
dense_rank()
row_number()

rank() - skips ranks in case of a tie. Example usage - Entrance Exam for a university.

dense_rank() - doesn't skip a rank. Example usage - To declare winners in an olympic.

领英推荐

Windowing Functions

Sachin D N ???? 1 年前

End to End Pyspark Example

Manoj Chandrashekar 2 年前

???? #DataScience Insight: The Significance of Data…

Indrajit S. 1 年前

row_number() - assigns a number to each row within a partition.

NOTE: Apart from the standard functions you can also apply other aggregate functions such as sum(), avg() on a window, based on the use case.

Usage of rank(), dense_rank() and row_number():

%python

from pyspark.sql import Window

myWindow = Window.partitionBy("country") \

.orderBy(desc("invoicevalue"))

from pyspark.sql.functions import row_number, rank, dense_rank, lag, lead, sum

// (from pyspark.sql.functions import *)

results_df = orders_df.withColumn("rank", rank().over(myWindow))

results_df.show()

// Similarly usage for dense_rank

results_df = orders_df.withColumn("dense rank", dense_rank().over(myWindow))

results_df.show()

// Similarly usage for row_number

results_df = orders_df.withColumn("row number", row_number().over(myWindow))

results_df.show()

Table showing difference between rank, dense rank and row number

We have 2 more important window functions: lead and lag.

Lead() - compares with the next row

Lag() - compares with the previous row

LAG() Usage:

%python
from pyspark.sql import Window

myWindow = Window.partitionBy("country") \
                       .orderBy("weeknum")

from pyspark.sql.functions import lag, lead

results_df = orders_df.withColumn("previous_week", lag("invoicevalue").over(myWindow))

final_df = results_df.withColumn("invoice_diff" , expr("invoicevalue - previous_week"))

Hope you understood what is a window and how to use a window function over a window.

NOTE - This topic is very important in case you are planning to appear for a data engineering interview. So, understand it well.

All the best. : )

要查看或添加评论，请登录

Navneet Singh G.的更多文章

Aggregate Functions

2025年3月2日

Aggregate Functions

Aggregate functions help to crunch multiple rows together to summarise things. There are 3 types of aggregate…
Change Data Feed in Databricks

2024年11月3日

Change Data Feed in Databricks

Change Data Feed in Delta Lake is a feature that allows you to track changes(inserts, updates and deletes) to a delta…
Install Azure CLI on macOS

2024年3月11日

Install Azure CLI on macOS

Open Visual Studio and run below command:/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.
Understanding Azure Databricks Private Link Deployment Types

2023年12月8日

Understanding Azure Databricks Private Link Deployment Types

There are the following two types of Private Link deployment that Databricks supports: Standard Deployment…
Service Endpoints and Service Endpoint Policy for Azure Databricks

2023年11月10日

Service Endpoints and Service Endpoint Policy for Azure Databricks

What are Service Endpoints? Service Endpoints enables private IP addresses in the VNet to reach the endpoint of an…
Install Terraform on MacOS

2023年11月3日

Install Terraform on MacOS

Let's start with downloading the terraform binary file, signature file, check sums and HashiCorp's GPG Key. Binary…
Egress with NAT Gateway in NPIP VNET Injected Azure Databricks Workspace

2023年10月22日

Egress with NAT Gateway in NPIP VNET Injected Azure Databricks Workspace

What is NPIP? No Public IP (NPIP) aka Secure Cluster Connectivity virtual networks have no open ports and Databricks…
Lakehouse Federation

2023年10月7日

Lakehouse Federation

Lakehouse federation is a public preview feature. It enables you to use Azure Databricks to run read-only queries…
Integrate Azure Data Factory's System-assigned Managed Identity with Azure Databricks

2023年10月3日

Integrate Azure Data Factory's System-assigned Managed Identity with Azure Databricks

There are various authentication types while creating linked service for Azure Databricks in Azure Data Factory. Those…

6 条评论
Azure Databricks: Expand and read Zip compressed files

2023年8月19日

Azure Databricks: Expand and read Zip compressed files

You can use the unzip Bash(Bourne Again Shell) command to expand files that have been Zip compressed. If you download…

See all articles

Window and windowing functions using DataFrame API in Spark

Navneet Singh G.

Azure Databricks ? Data Engineer ? PySpark ? SQL ? IaC ? Terraform ? Cloud Automation ? Agile Methodology ? AWS Cloud

What is Window class?

What is a Window Function ?

Common Window Functions

领英推荐

Navneet Singh G.的更多文章

社区洞察

其他会员也浏览了

???? #DataScience Insight: The Significance of Data Cleaning ????

Microsoft Data Platform News 2024 - Week 18

RDD vs Dataframe vs Dataset

Microsoft Fabric! A loader app created using the Spark Copilot, PySpark and the Fabric Lakehouses... (Part 2: The control flow!)

Simple ways to improve your PySpark and Parquet pipeline performance

3 Ways to Filter Data Based on String in PySpark

Getting ISO year right in PySpark

Pandas : Handling Data (DataFrame and Series)

Creating a dynamic date array for looping in Azure Data Factory

What is Window class?

What is a Window Function ?

Common Window Functions

领英推荐

Navneet Singh G.的更多文章

Aggregate Functions

Change Data Feed in Databricks

Install Azure CLI on macOS

Understanding Azure Databricks Private Link Deployment Types

Service Endpoints and Service Endpoint Policy for Azure Databricks

Install Terraform on MacOS

Egress with NAT Gateway in NPIP VNET Injected Azure Databricks Workspace

Lakehouse Federation

Integrate Azure Data Factory's System-assigned Managed Identity with Azure Databricks

Azure Databricks: Expand and read Zip compressed files

社区洞察

其他会员也浏览了

???? #DataScience Insight: The Significance of Data Cleaning ????

Microsoft Data Platform News 2024 - Week 18

RDD vs Dataframe vs Dataset

Microsoft Fabric! A loader app created using the Spark Copilot, PySpark and the Fabric Lakehouses... (Part 2: The control flow!)

Simple ways to improve your PySpark and Parquet pipeline performance

3 Ways to Filter Data Based on String in PySpark

Getting ISO year right in PySpark

Pandas : Handling Data (DataFrame and Series)

Creating a dynamic date array for looping in Azure Data Factory