Pivot and Unpivot in spark

Pivot and Unpivot in spark

How to do pivot and unpivot in pyspark?

Example 1: Pivot

Let's say you have a data frame with the following data:


+-----+-----+-------
|name |month|revenue|
+-----+-----+-------+
|Alice|Jan? |1000? ?|
|Bob? |Jan? |2000? ?|
|Alice|Feb? |1500? ?|
|Bob? |Feb? |2500? ?|
+-----+-----+-------+

        

And you want to pivot the data by month to get the following result:


+-----+----+----
|name |Jan |Feb |
+-----+----+----+
|Alice|1000|1500|
|Bob  |2000|2500|
+-----+----+----++        

You can do this using the pivot() function as follows:


from pyspark.sql.functions import 
from pyspark.sql import SparkSession

# create SparkSession
spark = SparkSession.builder.appName("pivot_example").getOrCreate()

# create DataFrame
data = [("Alice", "Jan", 1000),
        ("Bob", "Jan", 2000),
        ("Alice", "Feb", 1500),
        ("Bob", "Feb", 2500)]
df = spark.createDataFrame(data, ["name", "month", "revenue"])

# pivot the data
pivoted_df = df.pivot("month").agg(sum("revenue"))

# show the pivoted data
pivoted_df.show()*

+-----+----+----+
| name| Jan| Feb|
+-----+----+----+
|Alice|1000|1500|
|? Bob|2000|2500|
+-----+----+----+


        

Example 2: Unpivot

Let's say you have a DataFrame with the following data:


+-----+----+----+
| name| Jan| Feb|
+-----+----+----+
|Alice|1000|1500|
|  Bob|2000|2500|
+-----+----+----+
        

and you want the following output

+-----+-----+-------
|name |month|revenue|
+-----+-----+-------+
|Alice|Jan  |1000   |
|Alice|Feb  |1500   |
|Bob  |Jan  |2000   |
|Bob  |Feb  |2500   |
+-----+-----+-------++        

You can do this using the stack() function as follows:


from pyspark.sql.functions import 
from pyspark.sql import SparkSession


# create SparkSession
spark = SparkSession.builder.appName("unpivot_example").getOrCreate()


# create DataFrame
data = [("Alice", 1000, 1500),
? ? ? ? ("Bob", 2000, 2500)]
df = spark.createDataFrame(data, ["name", "Jan", "Feb"])


# unpivot the data
unpivoted_df = df.selectExpr("name", "stack(2, 'Jan', Jan, 'Feb', Feb) as (month, revenue)")


# show the unpivoted data
unpivoted_df.show()

*
        


+-----+-----+-------
| name|month|revenue|
+-----+-----+-------+
|Alice|  Jan|   1000|
|Alice|  Feb|   1500|
|  Bob|  Jan|   2000|
|  Bob|  Feb|   2500|
+-----+-----++        

要查看或添加评论,请登录

Arihant Shashank的更多文章

  • pandas udf on spark

    pandas udf on spark

    In PySpark, Pandas UDFs (user-defined functions) allow you to apply custom Python code that utilizes the Pandas library…

    1 条评论

社区洞察

其他会员也浏览了