Exploding arrays in Spark
When we perform a "explode" function into a dataframe we are focusing on a particular column, but in this dataframe there are always other columns and they relate to each other, so after the "explosion" we wanna see the new dataframe with all other columns, or maybe just some of them.
First, lets create a test dataframe with an array column
arrayData = [
? ? ? ? ('James',['Java','Scala'],{'hair':'black','eye':'brown'}),
? ? ? ? ('Michael',['Spark','Java',None],{'hair':'brown','eye':None}),
? ? ? ? ('Robert',['CSharp',''],{'hair':'red','eye':''}),
? ? ? ? ('Washington',None,None),
? ? ? ? ('Jefferson',['1','2'],{})]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages','properties'])
df.printSchema()
df.show()
Our dataframe is something like this:
+----------+-------------------+-----------------------------
|name |knownLanguages |properties |
+----------+-------------------+-----------------------------+
|James |[Java, Scala] |{eye -> brown, hair -> black}|
|Michael |[Spark, Java, null]|{eye -> null, hair -> brown} |
|Robert |[CSharp, ] |{eye -> , hair -> red} |
|Washington|null |null |
|Jefferson |[1, 2] |{} |
+----------+-------------------+-----------------------------++
Lets supose we wanna explode the array column "knownlanguages"
from pyspark.sql.functions import explode
df2=df.select(explode(df.knownLanguages).alias("exp_languages"))
df2.printSchema()
df2.show()
Our dataframe output is:
+-------------
|exp_languages|
+-------------+
| Java|
| Scala|
| Spark|
| Java|
| null|
| CSharp|
| |
| 1|
| 2|
+-------------++
But as mentioned before we wanna see how it relates to the other columns.
All columns + explode knownlanguages + drop unwanted columns
We can perform a first approach just apending the exploded column to the others, we can just add "*" to the select statement and all columns will be added to our output dataframe, also we can remove unwanted columns adding a drop in the end. The alias function here just rename the column exploded to a new name.
from pyspark.sql.functions import explode
df2=df.select("*",explode(df.knownLanguages).alias("exp_languages"))
.drop("knownLanguages")
df2.printSchema()
df2.show()
In this scenario the output would be like this:
领英推荐
+---------+--------------------+-------------
| name| properties|exp_languages|
+---------+--------------------+-------------+
| James|{eye -> brown, ha...| Java|
| James|{eye -> brown, ha...| Scala|
| Michael|{eye -> null, hai...| Spark|
| Michael|{eye -> null, hai...| Java|
| Michael|{eye -> null, hai...| null|
| Robert|{eye -> , hair ->...| CSharp|
| Robert|{eye -> , hair ->...| |
|Jefferson| {}| 1|
|Jefferson| {}| 2|
+---------+--------------------+-------------++
Selected columns + explode knownlanguages
If we wanna only select a few columns from our original dataframe and not having to drop columns we can use instead of "*" we could declare our columns.
In this example we are selecting name, properties and our new column "exp_languages".
from pyspark.sql.functions import explode
df2 = df.select("name", "properties", explode("knownLanguages").alias("exp_languages"))
df2.printSchema()
df2.show()
Our output dataframe is:
+---------+--------------------+-------------
| name| properties|exp_languages|
+---------+--------------------+-------------+
| James|{eye -> brown, ha...| Java|
| James|{eye -> brown, ha...| Scala|
| Michael|{eye -> null, hai...| Spark|
| Michael|{eye -> null, hai...| Java|
| Michael|{eye -> null, hai...| null|
| Robert|{eye -> , hair ->...| CSharp|
| Robert|{eye -> , hair ->...| |
|Jefferson| {}| 1|
|Jefferson| {}| 2|
+---------+--------------------+-------------++
The output dataframe is pretty much the same for both aproach in this simple scenario. But in real world dataframes with hundreds of columns it can be handy to know this options and how they can be implemented.
Sometimes we just wanna append a exploded column to all others and in other situation maybe select just a bunch of columns.
That's it! See you soon with other spark and big data curiosities!
Inspired on this article:
https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/