登录查看更多内容

Aggregate and Window Functions in Pyspark

NIKHIL G R

Serving Notice Period, Cloud Data Engineer at TCS, 2x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, ADLs, Azure Synapse, Azure Data factory, MySQL, Lake House, Delta Lake, Data Enthusiast

发布日期: 2023年12月4日

+ 关注

Aggregate Functions

These are the functions where the number of output rows will always be less than the number of input rows.

There are mainly two types of Aggregate Functions:

Simple Aggregate functionsThese are the functions which accepts multiple input rows to give only one output roweg) sum(), count(), distinct()
Grouping Aggregate functionsThese are the functions which accepts multiple input rows belonging to a group (grouping is done on a column). For each group, there will be one output row.eg) groupBy()

Simple Aggregate Function Example

Consider we have an orders.csv dataset and we are required to

Count total number of records.
Count number of distinct invoice ids
Find sum of quantities
Find average unit price

Grouping Aggregate Function Example

Consider we have an orders.csv dataset and we are required to group based on invoice number and country. We also need to

Find total quantity for each group
Find the total invoice amount (amount = quantity * unitprice)

Window Functions

These are the functions where the output is generated by performing operations on a predefined rows set within a window.eg) partitionBy(), orderBy(), rowsBetween()

Window Function Example

Consider we have window.csv. We need to define the following 3 parameters

Partition Colum - Partition by based on country
Sorting Column - Sort based on week number
Window Size - Define the size by mentioning the start row and end row

Find the running total of invoice value

Understanding rank(), dense_rank(), row_number(), lead() and lag()

rank() - Returns the rank of rows within a window partition.

Sachin D N ???? 10 个月前

Best Ways to Use Pandas with PySpark

Deepanshu Singh 1 个月前

PySpark

Mansi Mishra 4 年前

In rank(), some ranks can be skipped if there are clashes in the ranks.

dense_rank() - Returns the rank of rows within a window partition without any gaps.

In dense_rank(), the ranks are not skipped even if there are clashes in the ranks.

row_number() - Returns a sequential number starting at 1 within a window partition.

In row_number(), different row numbers are assigned even in case of a tie. It plays an important role in calculating the top-n results.

Realworld example of rank(), dense_rank() and row_number()

Consider an example scenario of how different ranks are assigned to the students based on the marks scored.

Consider an entrance exam for Engineering students and we only have 100 seats to be distributed among the students. There should not be any clashes among the ranks of the students during the selection process. Here, we need to use rank()
Consider we need to distribute the gold, silver and bronze medals among the participants where,

2 people got 100 points - 1st rank
1 person got   99 points - 2nd rank
1 person got   98 points - 3rd rank

Here, we need to use dense_rank()

When we need to compare two rows, lead() and lag() function should be used.

lead() - It is used when the current row needs to be compared with the next row.

lag() - It is used when the current row needs to be compared with the previous row.

For more information please check out:

Aggregate Functions

Window Functions

Credits - Sumit Mittal sir

Aggregate and Window Functions in Pyspark

NIKHIL G R

Serving Notice Period, Cloud Data Engineer at TCS, 2x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, ADLs, Azure Synapse, Azure Data factory, MySQL, Lake House, Delta Lake, Data Enthusiast

Aggregate Functions

Simple Aggregate Function Example

Grouping Aggregate Function Example

Window Functions

Window Function Example

Understanding rank(), dense_rank(), row_number(), lead() and lag()

领英推荐

Realworld example of rank(), dense_rank() and row_number()

更多精彩文章

社区洞察

其他会员也浏览了

PySpark

5 Books Every Data Professional Should?Read

Mastering Big Data Analysis with Python's Pandas: Unleash the Power of Scalable Data Processing

How to optimize Pyspark Codes for better efficiency.

End to End Pyspark Example

Topic: Enhancing Performance in PySpark with Vectorized Operations: pandas_udf vs Standard UDF....

???? #DataScience Insight: The Significance of Data Cleaning ????

PySpark

Troubleshooting executor out of memory error in Pyspark

Simple ways to improve your PySpark and Parquet pipeline performance

Aggregate Functions

Simple Aggregate Function Example

Grouping Aggregate Function Example

Window Functions

Window Function Example

Understanding rank(), dense_rank(), row_number(), lead() and lag()

领英推荐

Realworld example of rank(), dense_rank() and row_number()

Introduction to DBT (Data Build Tool)

2024年5月20日

DIFFERENCES IN SQL

2024年1月8日

Introduction to Azure Databricks (Part 2)

2023年12月6日

Introduction to Azure Databricks (Part 1)

2023年12月5日

Different ways of creating a Dataframe in Pyspark

2023年11月24日

Dataframes and Spark SQL Table

2023年11月23日

Dataframe Reader API

2023年11月22日

repartition vs coalesce in pyspark

2023年11月21日

Apache Spark on YARN Architecture

2023年11月16日

Introduction to Apache spark

2023年11月16日

社区洞察

其他会员也浏览了

PySpark

5 Books Every Data Professional Should?Read

Mastering Big Data Analysis with Python's Pandas: Unleash the Power of Scalable Data Processing

How to optimize Pyspark Codes for better efficiency.

End to End Pyspark Example

Topic: Enhancing Performance in PySpark with Vectorized Operations: pandas_udf vs Standard UDF....

???? #DataScience Insight: The Significance of Data Cleaning ????

PySpark

Troubleshooting executor out of memory error in Pyspark

Simple ways to improve your PySpark and Parquet pipeline performance