登录查看更多内容

Windowing Functions

Sachin D N ????

Data Consultant @ Lumen Technologies | Data Engineer | Big Data Engineer | AWS | Azure | Apache Spark | Databricks | Delta Lake | Agile | PySpark | Hadoop | Python | SQL | Hive | Data Lake | Data Warehousing | ADF

发布日期: 2024年3月25日

Windowing functions in PySpark and Spark SQL provide powerful ways to perform calculations against a group, or 'window', of rows related to the current row. They are essential tools for data analysis, particularly when working with ordered datasets. In this blog post, we'll explore five key windowing functions: rank, dense_rank, row_number, lead, and lag.

Let's consider a dataset of sales data with the following columns: Salesperson, Product, and Sales.

| Salesperson | Product | Sales |

|-----------|---------|-------|

| Alice | Apples | 100 |

| Alice | Apples | 150 |

| Alice | Apples | 200 |

| Alice | Apples | 300 |

In this dataset, Alice is selling Apples and some sales are the same which will help us demonstrate the difference between rank, dense_rank, and row_number.

Rank, Dense Rank, and Row Number

The rank function assigns a unique rank to each row within a window partition, with the same rank for rows with equal values. If there is a tie for a position, the subsequent ranks are skipped.

The dense_rank function is similar to rank, but it does not skip ranks if there is a tie.

The row_number function assigns a unique row number to each row within a window partition, regardless of the values in the rows.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank, dense_rank, row_number

windowSpec = Window.partitionBy("Salesperson", "Product").orderBy("Sales")

df.withColumn("rank", rank().over(windowSpec))\

.withColumn("dense_rank", dense_rank().over(windowSpec))\

.withColumn("row_number", row_number().over(windowSpec)).show()

This will output:

|-----------|---------|-------|-----|--------|---------|

| Alice | Apples | 100 | 1 | 1 | 1 |

| Alice | Apples | 100 | 1 | 1 | 2 |

| Alice | Apples | 150 | 3 | 2 | 3 |

| Alice | Apples | 200 | 4 | 3 | 4 |

| Alice | Apples | 200 | 4 | 3 | 5 |

| Alice | Apples | 200 | 4 | 3 | 6 |

Gregory Piatetsky-Shapiro 5 年前

Handling Big Data with XGBoost and Azure Databricks:…

Chirag S. 1 年前

Practice Window Functions for Data Analysis with…

Chonghua Yin 3 年前

| Alice | Apples | 300 | 7 | 4 | 7 |

In Spark SQL:

SELECT Salesperson, Product, Sales,

rank() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as rank,

dense_rank() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as dense_rank,

row_number() OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as row_number

FROM sales

Lead and Lag

The lead function returns the value of a given expression at a specified number of rows after the current row within a window partition. It can be used to compare the current sale with the next sale.

The lag function returns the value of a given expression at a specified number of rows before the current row within a window partition. It can be used to compare the current sale with the previous sale.

from pyspark.sql.functions import lead, lag

windowSpec = Window.partitionBy("Salesperson", "Product").orderBy("Sales")

df.withColumn("next_sale", lead("Sales").over(windowSpec))\

.withColumn("prev_sale", lag("Sales").over(windowSpec)).show()

This will output:

|-------------- |----------|-------|-----------|---------------|

| Alice | Apples | 100 | 150 | 100 |

| Alice | Apples | 150 | 200 | 100 |

| Alice | Apples | 200 | 200 | 150 |

| Alice | Apples | 200 | 200 | 200 |

| Alice | Apples | 200 | 300 | 200 |

In Spark SQL:

SELECT Salesperson, Product, Sales,

lead(Sales) OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as next_sale,

lag(Sales) OVER (PARTITION BY Salesperson, Product ORDER BY Sales) as prev_sale

FROM sales;

In conclusion, windowing functions in PySpark and Spark SQL provide powerful ways to perform complex calculations on ordered datasets. By mastering these functions, you can greatly enhance your data analysis capabilities.

#ApacheSpark #DistributedProcessing #DataFrame #BigDataAnalytics #DataEngineering #DataProcessing #sparksql

Windowing Functions

Sachin D N ????

Data Consultant @ Lumen Technologies | Data Engineer | Big Data Engineer | AWS | Azure | Apache Spark | Databricks | Delta Lake | Agile | PySpark | Hadoop | Python | SQL | Hive | Data Lake | Data Warehousing | ADF

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

End-To-End Data Processing

Choosing the Right Graphical Representation: Understanding the Differences between Bar Charts and Histograms

The Power of R for Data Analysis

Generating 1 Billion Rows of Complex Synthetic Data ??

Spark Tidbits - Lesson 6

Towards Easy and Fast Data Science Workflows with Optimus

What I have learned about data science

Why You Should Fire Your Data Scientist

How to optimize Pyspark Codes for better efficiency.

领英推荐

Aggregation Functions in PySpark

2024年3月22日

Accessing Columns in PySpark: A Comprehensive Guide

2024年3月20日

Understanding Spark on YARN Architecture

2024年3月17日

Deep Dive into Persist in Apache Spark

2024年3月15日

Deep Dive into Caching in Apache Spark

2024年3月14日

Mastering Spark Session Creation and Configuration in Apache Spark

2024年3月13日

Mastering DataFrame Transformations in Apache Spark

2024年3月12日

Handling Nested Schema in Apache Spark

2024年3月11日

Different Ways of Creating a DataFrame in Spark

2024年3月5日

?? Understanding Apache Spark Executors

2024年2月12日