Pivot and Unpivot in spark
Arihant Shashank
Data & Analytics Architect | Data Engineer, AWS, Snowflake, Machine learning, Visualization | Emerging LLM Engineer ?? Snowflakes Data superhero 2024,2025 ??
How to do pivot and unpivot in pyspark?
Example 1: Pivot
Let's say you have a data frame with the following data:
+-----+-----+-------
|name |month|revenue|
+-----+-----+-------+
|Alice|Jan? |1000? ?|
|Bob? |Jan? |2000? ?|
|Alice|Feb? |1500? ?|
|Bob? |Feb? |2500? ?|
+-----+-----+-------+
And you want to pivot the data by month to get the following result:
+-----+----+----
|name |Jan |Feb |
+-----+----+----+
|Alice|1000|1500|
|Bob |2000|2500|
+-----+----+----++
You can do this using the pivot() function as follows:
from pyspark.sql.functions import
from pyspark.sql import SparkSession
# create SparkSession
spark = SparkSession.builder.appName("pivot_example").getOrCreate()
# create DataFrame
data = [("Alice", "Jan", 1000),
("Bob", "Jan", 2000),
("Alice", "Feb", 1500),
("Bob", "Feb", 2500)]
df = spark.createDataFrame(data, ["name", "month", "revenue"])
# pivot the data
pivoted_df = df.pivot("month").agg(sum("revenue"))
# show the pivoted data
pivoted_df.show()*
+-----+----+----+
| name| Jan| Feb|
+-----+----+----+
|Alice|1000|1500|
|? Bob|2000|2500|
+-----+----+----+
Example 2: Unpivot
Let's say you have a DataFrame with the following data:
+-----+----+----+
| name| Jan| Feb|
+-----+----+----+
|Alice|1000|1500|
| Bob|2000|2500|
+-----+----+----+
and you want the following output
+-----+-----+-------
|name |month|revenue|
+-----+-----+-------+
|Alice|Jan |1000 |
|Alice|Feb |1500 |
|Bob |Jan |2000 |
|Bob |Feb |2500 |
+-----+-----+-------++
You can do this using the stack() function as follows:
from pyspark.sql.functions import
from pyspark.sql import SparkSession
# create SparkSession
spark = SparkSession.builder.appName("unpivot_example").getOrCreate()
# create DataFrame
data = [("Alice", 1000, 1500),
? ? ? ? ("Bob", 2000, 2500)]
df = spark.createDataFrame(data, ["name", "Jan", "Feb"])
# unpivot the data
unpivoted_df = df.selectExpr("name", "stack(2, 'Jan', Jan, 'Feb', Feb) as (month, revenue)")
# show the unpivoted data
unpivoted_df.show()
*
+-----+-----+-------
| name|month|revenue|
+-----+-----+-------+
|Alice| Jan| 1000|
|Alice| Feb| 1500|
| Bob| Jan| 2000|
| Bob| Feb| 2500|
+-----+-----++