登录查看更多内容

Pivot and Unpivot in spark

Arihant Shashank

Data & Analytics Architect | Data Engineer, AWS, Snowflake, Machine learning, Visualization | Emerging LLM Engineer ?? Snowflakes Data superhero 2024,2025 ??

发布日期: 2023年3月7日

+ 关注

How to do pivot and unpivot in pyspark?

Example 1: Pivot

Let's say you have a data frame with the following data:

+-----+-----+-------
|name |month|revenue|
+-----+-----+-------+
|Alice|Jan? |1000? ?|
|Bob? |Jan? |2000? ?|
|Alice|Feb? |1500? ?|
|Bob? |Feb? |2500? ?|
+-----+-----+-------+

And you want to pivot the data by month to get the following result:

+-----+----+----
|name |Jan |Feb |
+-----+----+----+
|Alice|1000|1500|
|Bob  |2000|2500|
+-----+----+----++

You can do this using the pivot() function as follows:

from pyspark.sql.functions import 
from pyspark.sql import SparkSession

# create SparkSession
spark = SparkSession.builder.appName("pivot_example").getOrCreate()

# create DataFrame
data = [("Alice", "Jan", 1000),
        ("Bob", "Jan", 2000),
        ("Alice", "Feb", 1500),
        ("Bob", "Feb", 2500)]
df = spark.createDataFrame(data, ["name", "month", "revenue"])

# pivot the data
pivoted_df = df.pivot("month").agg(sum("revenue"))

# show the pivoted data
pivoted_df.show()*

+-----+----+----+
| name| Jan| Feb|
+-----+----+----+
|Alice|1000|1500|
|? Bob|2000|2500|
+-----+----+----+

Example 2: Unpivot

Let's say you have a DataFrame with the following data:

+-----+----+----+
| name| Jan| Feb|
+-----+----+----+
|Alice|1000|1500|
|  Bob|2000|2500|
+-----+----+----+

and you want the following output

+-----+-----+-------
|name |month|revenue|
+-----+-----+-------+
|Alice|Jan  |1000   |
|Alice|Feb  |1500   |
|Bob  |Jan  |2000   |
|Bob  |Feb  |2500   |
+-----+-----+-------++

You can do this using the stack() function as follows:

from pyspark.sql.functions import 
from pyspark.sql import SparkSession


# create SparkSession
spark = SparkSession.builder.appName("unpivot_example").getOrCreate()


# create DataFrame
data = [("Alice", 1000, 1500),
? ? ? ? ("Bob", 2000, 2500)]
df = spark.createDataFrame(data, ["name", "Jan", "Feb"])


# unpivot the data
unpivoted_df = df.selectExpr("name", "stack(2, 'Jan', Jan, 'Feb', Feb) as (month, revenue)")


# show the unpivoted data
unpivoted_df.show()

*

+-----+-----+-------
| name|month|revenue|
+-----+-----+-------+
|Alice|  Jan|   1000|
|Alice|  Feb|   1500|
|  Bob|  Jan|   2000|
|  Bob|  Feb|   2500|
+-----+-----++

Arihant Shashank的更多文章

pandas udf on spark

2023年3月7日

pandas udf on spark

In PySpark, Pandas UDFs (user-defined functions) allow you to apply custom Python code that utilizes the Pandas library…

1 条评论

Pivot and Unpivot in spark

Arihant Shashank

Data & Analytics Architect | Data Engineer, AWS, Snowflake, Machine learning, Visualization | Emerging LLM Engineer ?? Snowflakes Data superhero 2024,2025 ??

Arihant Shashank的更多文章

社区洞察

其他会员也浏览了

Choosing the Right Graphical Representation: Understanding the Differences between Bar Charts and Histograms

Mastering Big Data Analysis with Python's Pandas: Unleash the Power of Scalable Data Processing

Spark Tidbits - Lesson 12

?? Unlock Time Series Insights Using Python’s KPSS Test ??

AIML23- Handling Large Data in Less Memory- Part-01

End to End Pyspark Example

Optimizing Databricks Workflows with Parameterization and Notebook Chaining: A Real-Time Scenario with PySpark

???? #DataScience Insight: The Significance of Data Cleaning ????

Microsoft Data Platform News 2024 - Week 18

Arihant Shashank的更多文章

pandas udf on spark

社区洞察

其他会员也浏览了

Choosing the Right Graphical Representation: Understanding the Differences between Bar Charts and Histograms

Mastering Big Data Analysis with Python's Pandas: Unleash the Power of Scalable Data Processing

Spark Tidbits - Lesson 12

?? Unlock Time Series Insights Using Python’s KPSS Test ??

AIML23- Handling Large Data in Less Memory- Part-01

End to End Pyspark Example

Optimizing Databricks Workflows with Parameterization and Notebook Chaining: A Real-Time Scenario with PySpark

???? #DataScience Insight: The Significance of Data Cleaning ????

Microsoft Data Platform News 2024 - Week 18