登录查看更多内容

Aggregation Functions in PySpark

Sachin D N ????

Data Consultant @ Lumen Technologies | Data Engineer | Big Data Engineer | AWS | Azure | Apache Spark | Databricks | Delta Lake | Agile | PySpark | Hadoop | Python | SQL | Hive | Data Lake | Data Warehousing | ADF

发布日期: 2024年3月22日

Apache Spark is a powerful open-source processing engine for big data built around speed, ease of use, and sophisticated analytics. PySpark, the Python library for Spark, allows you to harness the simplicity of Python and the power of Apache Spark to tame Big Data. In this blog post, we will explore the use of aggregation functions in PySpark.

Simple Aggregations

Aggregation functions combine multiple input rows to provide a consolidated output. This is a common operation in data analysis, especially when dealing with large datasets. For instance, you might want to count the total number of records, count the number of distinct items, sum a particular column, or calculate the average of a column. Here are three ways to perform these operations in PySpark:

1. Programmatic Style

In the programmatic style, we use PySpark's built-in functions like count, countDistinct, sum, and avg to perform the aggregations. We use the select method to specify the columns we're interested in. The alias function is used to rename the output column.

from pyspark.sql.functions import *

df.select(

count("*").alias("row_count"),

countDistinct("column1").alias("unique_items"),

sum("column2").alias("total_sum"),

avg("column3").alias("avg_value")

).show()

2. Column Expression Style

In the column expression style, we use the selectExpr method, which allows us to write the expressions as strings. This can be more convenient if you're used to writing SQL queries. The expressions are written as they would be in a SQL query.

df.selectExpr(

"count(*) as row_count",

"count(distinct column1) as unique_items",

"sum(column2) as total_sum",

"avg(column3) as avg_value"

).show()

3. Spark SQL Style

In the Spark SQL style, we write the entire query as a SQL statement. This can be the most familiar and convenient way for those who are comfortable with SQL. The SQL query is passed as a string to the spark.sql method.

spark.sql("""

SELECT

count(*) as row_count,

count(distinct column1) as unique_items,

sum(column2) as total_sum,

avg(column3) as avg_value

FROM table

""").show()

Grouping Aggregations

Grouping aggregations are used when you want to group data based on certain columns and perform aggregations on each group. This is a common operation when you want to summarize data by category. For instance, you might want to group data based on a particular column and find the total and average of another column for each group.

1. Programmatic Style

In the programmatic style, we use the groupBy method to specify the columns to group by, and the agg method to specify the aggregation functions. The sort method is used to sort the output based on a column.

from pyspark.sql.functions import *

summary_df = df \

Arno Wakfer MCT 6 个月前

Exploring Data Operations with PySpark, Pandas…

Alex Merced 1 个月前

PySpark Introduction: Powering Big Data Processing…

Eduardo Miranda 3 个月前

.groupBy("column1") \

.agg(

sum("column2").alias("total_sum"),

avg("column3").alias("avg_value")

).sort("column1")

```

2. Column Expression Style

In the column expression style, we use the groupBy method to specify the columns to group by, and the agg method with string expressions to specify the aggregation functions. The expr function is used to parse the string expressions.

summary_df = df \

.groupBy("column1") \

.agg(

expr("sum(column2) as total_sum"),

expr("avg(column3) as avg_value")

).sort("column1")

3. Spark SQL Style

In the Spark SQL style, we write the entire query as a SQL statement, including the GROUP BY clause. The SQL query is passed as a string to the spark.sql method.

df.createOrReplaceTempView("table")

spark.sql("""

SELECT

column1,

sum(column2) as total_sum,

avg(column3) as avg_value

FROM table

GROUP BY column1

ORDER BY column1

""").show()

Windowing Aggregations

Windowing aggregations are a type of aggregation where the function is applied to a frame or 'window' of rows. The window specification defines how to group rows into frames and how to order the rows in the frame. This is useful when you want to calculate running totals or moving averages.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

windowSpec = Window.partitionBy("column1").orderBy("column2").rowsBetween(-1, 1)

df.withColumn("moving_avg", sum("column3").over(windowSpec)).show()

In this example, we define a window that includes the current row and the one row before and after it (for a total of three rows). We then calculate the sum of a particular column over this window. The partitionBy method is used to specify the column to group by, the orderBy method is used to specify the column to order by, and the rowsBetween method is used to specify the range of rows in each window.

In conclusion, PySpark provides a variety of ways to perform aggregations, each with its own advantages. By understanding these different methods, you can write more efficient and readable PySpark code.

#ApacheSpark #DistributedProcessing #DataFrame #BigDataAnalytics #DataEngineering #DataProcessing #sparksql #sparkaggregations

Nirmalya Das

7 个月

Thanks for sharing Sachin D N ????

1 次回应

Sachin D N ????

8 个月

Thanks Nirmalya Das for Reposting

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Aggregation Functions in PySpark

Sachin D N ????

Data Consultant @ Lumen Technologies | Data Engineer | Big Data Engineer | AWS | Azure | Apache Spark | Databricks | Delta Lake | Agile | PySpark | Hadoop | Python | SQL | Hive | Data Lake | Data Warehousing | ADF

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

BigData Analytics with PySpark

Understanding the PySpark

Best Ways to Use Pandas with PySpark

Expedite Apache Spark Queries with Bloom Filter Indexing

Simplifying Apache Spark usage with Optimus

PySpark

How to implement Apache Spark in Data Processing and Analytics?

Spark Tidbits - Lesson 8

Practical Apache Spark in 10 minutes. Part 7 — GraphX and Neo4j

PySpark

领英推荐

Windowing Functions

2024年3月25日

Accessing Columns in PySpark: A Comprehensive Guide

2024年3月20日

Understanding Spark on YARN Architecture

2024年3月17日

Deep Dive into Persist in Apache Spark

2024年3月15日

Deep Dive into Caching in Apache Spark

2024年3月14日

Mastering Spark Session Creation and Configuration in Apache Spark

2024年3月13日

Mastering DataFrame Transformations in Apache Spark

2024年3月12日

Handling Nested Schema in Apache Spark

2024年3月11日

Different Ways of Creating a DataFrame in Spark

2024年3月5日

?? Understanding Apache Spark Executors

2024年2月12日

社区洞察

其他会员也浏览了

BigData Analytics with PySpark

Understanding the PySpark

Best Ways to Use Pandas with PySpark

Expedite Apache Spark Queries with Bloom Filter Indexing

Simplifying Apache Spark usage with Optimus

PySpark

How to implement Apache Spark in Data Processing and Analytics?

Spark Tidbits - Lesson 8

Practical Apache Spark in 10 minutes. Part 7 — GraphX and Neo4j

PySpark