Exploding arrays in Spark
https://en.wikipedia.org/wiki/Explosion

Exploding arrays in Spark

When we perform a "explode" function into a dataframe we are focusing on a particular column, but in this dataframe there are always other columns and they relate to each other, so after the "explosion" we wanna see the new dataframe with all other columns, or maybe just some of them.

First, lets create a test dataframe with an array column

arrayData = [
? ? ? ? ('James',['Java','Scala'],{'hair':'black','eye':'brown'}),
? ? ? ? ('Michael',['Spark','Java',None],{'hair':'brown','eye':None}),
? ? ? ? ('Robert',['CSharp',''],{'hair':'red','eye':''}),
? ? ? ? ('Washington',None,None),
? ? ? ? ('Jefferson',['1','2'],{})]


df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages','properties'])
df.printSchema()
df.show()

        

Our dataframe is something like this:

+----------+-------------------+-----------------------------
|name      |knownLanguages     |properties                   |
+----------+-------------------+-----------------------------+
|James     |[Java, Scala]      |{eye -> brown, hair -> black}|
|Michael   |[Spark, Java, null]|{eye -> null, hair -> brown} |
|Robert    |[CSharp, ]         |{eye -> , hair -> red}       |
|Washington|null               |null                         |
|Jefferson |[1, 2]             |{}                           |
+----------+-------------------+-----------------------------++        

Lets supose we wanna explode the array column "knownlanguages"

from pyspark.sql.functions import explode

df2=df.select(explode(df.knownLanguages).alias("exp_languages"))
df2.printSchema()
df2.show()        

Our dataframe output is:

+-------------
|exp_languages|
+-------------+
|         Java|
|        Scala|
|        Spark|
|         Java|
|         null|
|       CSharp|
|             |
|            1|
|            2|
+-------------++        

But as mentioned before we wanna see how it relates to the other columns.


All columns + explode knownlanguages + drop unwanted columns

We can perform a first approach just apending the exploded column to the others, we can just add "*" to the select statement and all columns will be added to our output dataframe, also we can remove unwanted columns adding a drop in the end. The alias function here just rename the column exploded to a new name.

from pyspark.sql.functions import explode

df2=df.select("*",explode(df.knownLanguages).alias("exp_languages"))
.drop("knownLanguages")
df2.printSchema()
df2.show()
        

In this scenario the output would be like this:

+---------+--------------------+-------------
|     name|          properties|exp_languages|
+---------+--------------------+-------------+
|    James|{eye -> brown, ha...|         Java|
|    James|{eye -> brown, ha...|        Scala|
|  Michael|{eye -> null, hai...|        Spark|
|  Michael|{eye -> null, hai...|         Java|
|  Michael|{eye -> null, hai...|         null|
|   Robert|{eye -> , hair ->...|       CSharp|
|   Robert|{eye -> , hair ->...|             |
|Jefferson|                  {}|            1|
|Jefferson|                  {}|            2|
+---------+--------------------+-------------++        


Selected columns + explode knownlanguages

If we wanna only select a few columns from our original dataframe and not having to drop columns we can use instead of "*" we could declare our columns.

In this example we are selecting name, properties and our new column "exp_languages".

from pyspark.sql.functions import explode

df2 = df.select("name", "properties", explode("knownLanguages").alias("exp_languages"))
df2.printSchema()
df2.show()        

Our output dataframe is:


+---------+--------------------+-------------
|     name|          properties|exp_languages|
+---------+--------------------+-------------+
|    James|{eye -> brown, ha...|         Java|
|    James|{eye -> brown, ha...|        Scala|
|  Michael|{eye -> null, hai...|        Spark|
|  Michael|{eye -> null, hai...|         Java|
|  Michael|{eye -> null, hai...|         null|
|   Robert|{eye -> , hair ->...|       CSharp|
|   Robert|{eye -> , hair ->...|             |
|Jefferson|                  {}|            1|
|Jefferson|                  {}|            2|
+---------+--------------------+-------------++        


The output dataframe is pretty much the same for both aproach in this simple scenario. But in real world dataframes with hundreds of columns it can be handy to know this options and how they can be implemented.


Sometimes we just wanna append a exploded column to all others and in other situation maybe select just a bunch of columns.

That's it! See you soon with other spark and big data curiosities!

Inspired on this article:

https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/

要查看或添加评论,请登录

Afonso Orgino Lenzi的更多文章

  • Unix Timestamp and Date functions at Spark

    Unix Timestamp and Date functions at Spark

    Unix time is a method to represent a timestamp, and is usually defined as the number of seconds since the beginning of…

    3 条评论
  • Using lambda functions on spark dataframe

    Using lambda functions on spark dataframe

    A lambda function is a small anonymous function. The basic sintaxe to use it in python is: So one basic example could…

    2 条评论
  • Dealing with nested arrays in Spark

    Dealing with nested arrays in Spark

    Lets supose you receive a data frame with nested arrays like this bellow , and you are asked to explode all the…

    5 条评论
  • Data Engineering com Azure Databricks - Parte 2

    Data Engineering com Azure Databricks - Parte 2

    Esta é a segunda parte do estudo que fizemos para fazer a ingest?o de um json de 2gb com reclama??es de consumidores…

    11 条评论
  • Data Engineering na plataforma Azure - Parte 1

    Data Engineering na plataforma Azure - Parte 1

    Na pós gradua??o que estou fazendo criamos um grupo de estudos onde de tempos em tempos algum aluno fala sobre um tema…

    18 条评论
  • Regress?o linear simples - criando uma fun??o em python

    Regress?o linear simples - criando uma fun??o em python

    A título de curiosidade caso alguém n?o saiba como é definida a equa??o de regress?o segue um breve relato e exemplo em…

    5 条评论

社区洞察

其他会员也浏览了