Aggregate and Window Functions in Pyspark

Aggregate and Window Functions in Pyspark

Aggregate Functions

These are the functions where the number of output rows will always be less than the number of input rows.

There are mainly two types of Aggregate Functions:

  • Simple Aggregate functionsThese are the functions which accepts multiple input rows to give only one output roweg) sum(), count(), distinct()
  • Grouping Aggregate functionsThese are the functions which accepts multiple input rows belonging to a group (grouping is done on a column). For each group, there will be one output row.eg) groupBy()

Simple Aggregate Function Example

Consider we have an orders.csv dataset and we are required to

  • Count total number of records.
  • Count number of distinct invoice ids
  • Find sum of quantities
  • Find average unit price

Grouping Aggregate Function Example

Consider we have an orders.csv dataset and we are required to group based on invoice number and country. We also need to

  • Find total quantity for each group
  • Find the total invoice amount (amount = quantity * unitprice)

Window Functions

  • These are the functions where the output is generated by performing operations on a predefined rows set within a window.eg) partitionBy(), orderBy(), rowsBetween()

Window Function Example

Consider we have window.csv. We need to define the following 3 parameters

  • Partition Colum - Partition by based on country
  • Sorting Column - Sort based on week number
  • Window Size - Define the size by mentioning the start row and end row

Find the running total of invoice value

Understanding rank(), dense_rank(), row_number(), lead() and lag()

rank() - Returns the rank of rows within a window partition.

In rank(), some ranks can be skipped if there are clashes in the ranks.

dense_rank() - Returns the rank of rows within a window partition without any gaps.

In dense_rank(), the ranks are not skipped even if there are clashes in the ranks.

row_number() - Returns a sequential number starting at 1 within a window partition.

In row_number(), different row numbers are assigned even in case of a tie. It plays an important role in calculating the top-n results.

Realworld example of rank(), dense_rank() and row_number()

Consider an example scenario of how different ranks are assigned to the students based on the marks scored.

  1. Consider an entrance exam for Engineering students and we only have 100 seats to be distributed among the students. There should not be any clashes among the ranks of the students during the selection process. Here, we need to use rank()
  2. Consider we need to distribute the gold, silver and bronze medals among the participants where,

2 people got 100 points - 1st rank
1 person got   99 points - 2nd rank
1 person got   98 points - 3rd rank        

Here, we need to use dense_rank()

When we need to compare two rows, lead() and lag() function should be used.

lead() - It is used when the current row needs to be compared with the next row.

lag() - It is used when the current row needs to be compared with the previous row.

For more information please check out:

Aggregate Functions

Window Functions

Credits - Sumit Mittal sir

要查看或添加评论,请登录

Nikhil G R的更多文章

  • Introduction to DBT (Data Build Tool)

    Introduction to DBT (Data Build Tool)

    dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse…

  • DIFFERENCES IN SQL

    DIFFERENCES IN SQL

    WHERE vs HAVING WHERE and HAVING clauses are both used in SQL to filter data. WHERE WHERE clause should be used before…

  • Introduction to Azure Databricks (Part 2)

    Introduction to Azure Databricks (Part 2)

    DBFS (Databricks File System) It is a Distributed File System. It is mounted into a databricks workspace.

  • Introduction to Azure Databricks (Part 1)

    Introduction to Azure Databricks (Part 1)

    Databricks is a company created by the creators of Apache Spark. It is an Apache Spark based unified analytics platform…

  • Different ways of creating a Dataframe in Pyspark

    Different ways of creating a Dataframe in Pyspark

    Using spark.read Using spark.

  • Dataframes and Spark SQL Table

    Dataframes and Spark SQL Table

    Dataframes These are in the form of RDDs with some structure/schema which is not persistent as it is available only in…

  • Dataframe Reader API

    Dataframe Reader API

    We can read the different format of files using the Dataframe Reader API. Standard way to create a Dataframe Instead of…

  • repartition vs coalesce in pyspark

    repartition vs coalesce in pyspark

    repartition There can be a case if we need to increase or decrease partitions to get more parallesism. repartition can…

    2 条评论
  • Apache Spark on YARN Architecture

    Apache Spark on YARN Architecture

    Before going through the Spark architecture, let us understand the Hadoop ecosystem. The core components of Hadoop are…

  • Introduction to Apache spark

    Introduction to Apache spark

    Apache Spark is a Distributed Computing Framework. Before going into Apache Spark let us understand what are the…

    1 条评论

社区洞察

其他会员也浏览了