登录查看更多内容

Aggregate and Window Functions in Pyspark

Nikhil G R

Senior Data Engineer (Apache Spark Developer) @ SAP Labs India, Ex TCS, 3x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, SAP BDC, Datasphere, ADLs, Azure Data factory, MySQL, Delta Lake

发布日期: 2023年12月4日

+ 关注

Aggregate Functions

These are the functions where the number of output rows will always be less than the number of input rows.

There are mainly two types of Aggregate Functions:

Simple Aggregate functionsThese are the functions which accepts multiple input rows to give only one output roweg) sum(), count(), distinct()
Grouping Aggregate functionsThese are the functions which accepts multiple input rows belonging to a group (grouping is done on a column). For each group, there will be one output row.eg) groupBy()

Simple Aggregate Function Example

Consider we have an orders.csv dataset and we are required to

Count total number of records.
Count number of distinct invoice ids
Find sum of quantities
Find average unit price

Grouping Aggregate Function Example

Consider we have an orders.csv dataset and we are required to group based on invoice number and country. We also need to

Find total quantity for each group
Find the total invoice amount (amount = quantity * unitprice)

Window Functions

These are the functions where the output is generated by performing operations on a predefined rows set within a window.eg) partitionBy(), orderBy(), rowsBetween()

Window Function Example

Consider we have window.csv. We need to define the following 3 parameters

Partition Colum - Partition by based on country
Sorting Column - Sort based on week number
Window Size - Define the size by mentioning the start row and end row

Find the running total of invoice value

Understanding rank(), dense_rank(), row_number(), lead() and lag()

rank() - Returns the rank of rows within a window partition.

领英推荐

Harnessing the Power of PySpark in DataBricks Delta…

New Math Data 6 个月前

How to Perform Basic Operation with Pyspark

CodersArts 2 年前

Getting started with PySpark on Google Colab

Eduardo Miranda 6 个月前

In rank(), some ranks can be skipped if there are clashes in the ranks.

dense_rank() - Returns the rank of rows within a window partition without any gaps.

In dense_rank(), the ranks are not skipped even if there are clashes in the ranks.

row_number() - Returns a sequential number starting at 1 within a window partition.

In row_number(), different row numbers are assigned even in case of a tie. It plays an important role in calculating the top-n results.

Realworld example of rank(), dense_rank() and row_number()

Consider an example scenario of how different ranks are assigned to the students based on the marks scored.

Consider an entrance exam for Engineering students and we only have 100 seats to be distributed among the students. There should not be any clashes among the ranks of the students during the selection process. Here, we need to use rank()
Consider we need to distribute the gold, silver and bronze medals among the participants where,

2 people got 100 points - 1st rank
1 person got   99 points - 2nd rank
1 person got   98 points - 3rd rank

Here, we need to use dense_rank()

When we need to compare two rows, lead() and lag() function should be used.

lead() - It is used when the current row needs to be compared with the next row.

lag() - It is used when the current row needs to be compared with the previous row.

For more information please check out:

Aggregate Functions

Window Functions

Credits - Sumit Mittal sir

要查看或添加评论，请登录

Nikhil G R的更多文章

Introduction to DBT (Data Build Tool)

2024年5月20日

Introduction to DBT (Data Build Tool)

dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse…
DIFFERENCES IN SQL

2024年1月8日

DIFFERENCES IN SQL

WHERE vs HAVING WHERE and HAVING clauses are both used in SQL to filter data. WHERE WHERE clause should be used before…
Introduction to Azure Databricks (Part 2)

2023年12月6日

Introduction to Azure Databricks (Part 2)

DBFS (Databricks File System) It is a Distributed File System. It is mounted into a databricks workspace.
Introduction to Azure Databricks (Part 1)

2023年12月5日

Introduction to Azure Databricks (Part 1)

Databricks is a company created by the creators of Apache Spark. It is an Apache Spark based unified analytics platform…
Different ways of creating a Dataframe in Pyspark

2023年11月24日

Different ways of creating a Dataframe in Pyspark

Using spark.read Using spark.
Dataframes and Spark SQL Table

2023年11月23日

Dataframes and Spark SQL Table

Dataframes These are in the form of RDDs with some structure/schema which is not persistent as it is available only in…
Dataframe Reader API

2023年11月22日

Dataframe Reader API

We can read the different format of files using the Dataframe Reader API. Standard way to create a Dataframe Instead of…
repartition vs coalesce in pyspark

2023年11月21日

repartition vs coalesce in pyspark

repartition There can be a case if we need to increase or decrease partitions to get more parallesism. repartition can…

2 条评论
Apache Spark on YARN Architecture

2023年11月16日

Apache Spark on YARN Architecture

Before going through the Spark architecture, let us understand the Hadoop ecosystem. The core components of Hadoop are…
Introduction to Apache spark

2023年11月16日

Introduction to Apache spark

Apache Spark is a Distributed Computing Framework. Before going into Apache Spark let us understand what are the…

1 条评论

See all articles

Aggregate and Window Functions in Pyspark

Nikhil G R

Senior Data Engineer (Apache Spark Developer) @ SAP Labs India, Ex TCS, 3x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, SAP BDC, Datasphere, ADLs, Azure Data factory, MySQL, Delta Lake

Aggregate Functions

Simple Aggregate Function Example

Grouping Aggregate Function Example

Window Functions

Window Function Example

Understanding rank(), dense_rank(), row_number(), lead() and lag()

领英推荐

Realworld example of rank(), dense_rank() and row_number()

Nikhil G R的更多文章

社区洞察

其他会员也浏览了

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

PySpark

My PySpark Job Is Taking Forever… Now What? ?

Mastering Big Data Analysis with Python's Pandas: Unleash the Power of Scalable Data Processing

Getting ISO year right in PySpark

Handling Data Skewness in Spark: The Power of Salting in PySpark ??

Pandas vs PySpark: Choosing the Right Tool for Your Data Needs

Comprеhеnsivе PySpark Job Support for Sеamlеss Big Data Procеssing

Simple ways to improve your PySpark and Parquet pipeline performance

Aggregate Functions

Simple Aggregate Function Example

Grouping Aggregate Function Example

Window Functions

Window Function Example

Understanding rank(), dense_rank(), row_number(), lead() and lag()

领英推荐

Realworld example of rank(), dense_rank() and row_number()

Nikhil G R的更多文章

Introduction to DBT (Data Build Tool)

DIFFERENCES IN SQL

Introduction to Azure Databricks (Part 2)

Introduction to Azure Databricks (Part 1)

Different ways of creating a Dataframe in Pyspark

Dataframes and Spark SQL Table

Dataframe Reader API

repartition vs coalesce in pyspark

Apache Spark on YARN Architecture

Introduction to Apache spark

社区洞察

其他会员也浏览了

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

PySpark

My PySpark Job Is Taking Forever… Now What? ?

Mastering Big Data Analysis with Python's Pandas: Unleash the Power of Scalable Data Processing

Getting ISO year right in PySpark

Handling Data Skewness in Spark: The Power of Salting in PySpark ??

Pandas vs PySpark: Choosing the Right Tool for Your Data Needs

Comprеhеnsivе PySpark Job Support for Sеamlеss Big Data Procеssing

Simple ways to improve your PySpark and Parquet pipeline performance