Aggregate and Window Functions in Pyspark

Aggregate and Window Functions in Pyspark

Aggregate Functions

These are the functions where the number of output rows will always be less than the number of input rows.

There are mainly two types of Aggregate Functions:

  • Simple Aggregate functionsThese are the functions which accepts multiple input rows to give only one output roweg) sum(), count(), distinct()
  • Grouping Aggregate functionsThese are the functions which accepts multiple input rows belonging to a group (grouping is done on a column). For each group, there will be one output row.eg) groupBy()

Simple Aggregate Function Example

Consider we have an orders.csv dataset and we are required to

  • Count total number of records.
  • Count number of distinct invoice ids
  • Find sum of quantities
  • Find average unit price

Grouping Aggregate Function Example

Consider we have an orders.csv dataset and we are required to group based on invoice number and country. We also need to

  • Find total quantity for each group
  • Find the total invoice amount (amount = quantity * unitprice)

Window Functions

  • These are the functions where the output is generated by performing operations on a predefined rows set within a window.eg) partitionBy(), orderBy(), rowsBetween()

Window Function Example

Consider we have window.csv. We need to define the following 3 parameters

  • Partition Colum - Partition by based on country
  • Sorting Column - Sort based on week number
  • Window Size - Define the size by mentioning the start row and end row

Find the running total of invoice value

Understanding rank(), dense_rank(), row_number(), lead() and lag()

rank() - Returns the rank of rows within a window partition.

In rank(), some ranks can be skipped if there are clashes in the ranks.

dense_rank() - Returns the rank of rows within a window partition without any gaps.

In dense_rank(), the ranks are not skipped even if there are clashes in the ranks.

row_number() - Returns a sequential number starting at 1 within a window partition.

In row_number(), different row numbers are assigned even in case of a tie. It plays an important role in calculating the top-n results.

Realworld example of rank(), dense_rank() and row_number()

Consider an example scenario of how different ranks are assigned to the students based on the marks scored.

  1. Consider an entrance exam for Engineering students and we only have 100 seats to be distributed among the students. There should not be any clashes among the ranks of the students during the selection process. Here, we need to use rank()
  2. Consider we need to distribute the gold, silver and bronze medals among the participants where,

2 people got 100 points - 1st rank
1 person got   99 points - 2nd rank
1 person got   98 points - 3rd rank        

Here, we need to use dense_rank()

When we need to compare two rows, lead() and lag() function should be used.

lead() - It is used when the current row needs to be compared with the next row.

lag() - It is used when the current row needs to be compared with the previous row.

For more information please check out:

Aggregate Functions

Window Functions

Credits - Sumit Mittal sir

要查看或添加评论,请登录

社区洞察

其他会员也浏览了