登录查看更多内容

Exploding arrays in Spark

Afonso Orgino Lenzi

| Databricks | Data Engineering |

发布日期: 2023年4月15日

When we perform a "explode" function into a dataframe we are focusing on a particular column, but in this dataframe there are always other columns and they relate to each other, so after the "explosion" we wanna see the new dataframe with all other columns, or maybe just some of them.

First, lets create a test dataframe with an array column

arrayData = [
? ? ? ? ('James',['Java','Scala'],{'hair':'black','eye':'brown'}),
? ? ? ? ('Michael',['Spark','Java',None],{'hair':'brown','eye':None}),
? ? ? ? ('Robert',['CSharp',''],{'hair':'red','eye':''}),
? ? ? ? ('Washington',None,None),
? ? ? ? ('Jefferson',['1','2'],{})]


df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages','properties'])
df.printSchema()
df.show()

Our dataframe is something like this:

+----------+-------------------+-----------------------------
|name      |knownLanguages     |properties                   |
+----------+-------------------+-----------------------------+
|James     |[Java, Scala]      |{eye -> brown, hair -> black}|
|Michael   |[Spark, Java, null]|{eye -> null, hair -> brown} |
|Robert    |[CSharp, ]         |{eye -> , hair -> red}       |
|Washington|null               |null                         |
|Jefferson |[1, 2]             |{}                           |
+----------+-------------------+-----------------------------++

Lets supose we wanna explode the array column "knownlanguages"

from pyspark.sql.functions import explode

df2=df.select(explode(df.knownLanguages).alias("exp_languages"))
df2.printSchema()
df2.show()

Our dataframe output is:

+-------------
|exp_languages|
+-------------+
|         Java|
|        Scala|
|        Spark|
|         Java|
|         null|
|       CSharp|
|             |
|            1|
|            2|
+-------------++

But as mentioned before we wanna see how it relates to the other columns.

All columns + explode knownlanguages + drop unwanted columns

We can perform a first approach just apending the exploded column to the others, we can just add "*" to the select statement and all columns will be added to our output dataframe, also we can remove unwanted columns adding a drop in the end. The alias function here just rename the column exploded to a new name.

from pyspark.sql.functions import explode

df2=df.select("*",explode(df.knownLanguages).alias("exp_languages"))
.drop("knownLanguages")
df2.printSchema()
df2.show()

In this scenario the output would be like this:

领英推荐

Distributed Bloom Filter

Patrick Nicolas 8 个月前

FLaNK Stack Weekly January 29, 2024

Tim Spann 1 年前

AI and All Data Weekly for 09 Dec 2024

Tim Spann 3 个月前

+---------+--------------------+-------------
|     name|          properties|exp_languages|
+---------+--------------------+-------------+
|    James|{eye -> brown, ha...|         Java|
|    James|{eye -> brown, ha...|        Scala|
|  Michael|{eye -> null, hai...|        Spark|
|  Michael|{eye -> null, hai...|         Java|
|  Michael|{eye -> null, hai...|         null|
|   Robert|{eye -> , hair ->...|       CSharp|
|   Robert|{eye -> , hair ->...|             |
|Jefferson|                  {}|            1|
|Jefferson|                  {}|            2|
+---------+--------------------+-------------++

Selected columns + explode knownlanguages

If we wanna only select a few columns from our original dataframe and not having to drop columns we can use instead of "*" we could declare our columns.

In this example we are selecting name, properties and our new column "exp_languages".

from pyspark.sql.functions import explode

df2 = df.select("name", "properties", explode("knownLanguages").alias("exp_languages"))
df2.printSchema()
df2.show()

Our output dataframe is:

+---------+--------------------+-------------
|     name|          properties|exp_languages|
+---------+--------------------+-------------+
|    James|{eye -> brown, ha...|         Java|
|    James|{eye -> brown, ha...|        Scala|
|  Michael|{eye -> null, hai...|        Spark|
|  Michael|{eye -> null, hai...|         Java|
|  Michael|{eye -> null, hai...|         null|
|   Robert|{eye -> , hair ->...|       CSharp|
|   Robert|{eye -> , hair ->...|             |
|Jefferson|                  {}|            1|
|Jefferson|                  {}|            2|
+---------+--------------------+-------------++

The output dataframe is pretty much the same for both aproach in this simple scenario. But in real world dataframes with hundreds of columns it can be handy to know this options and how they can be implemented.

Sometimes we just wanna append a exploded column to all others and in other situation maybe select just a bunch of columns.

That's it! See you soon with other spark and big data curiosities!

Inspired on this article:

https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/

要查看或添加评论，请登录

Afonso Orgino Lenzi的更多文章

Unix Timestamp and Date functions at Spark

2024年7月23日

Unix Timestamp and Date functions at Spark

Unix time is a method to represent a timestamp, and is usually defined as the number of seconds since the beginning of…

3 条评论
Using lambda functions on spark dataframe

2024年7月16日

Using lambda functions on spark dataframe

A lambda function is a small anonymous function. The basic sintaxe to use it in python is: So one basic example could…

2 条评论
Dealing with nested arrays in Spark

2023年3月25日

Dealing with nested arrays in Spark

Lets supose you receive a data frame with nested arrays like this bellow , and you are asked to explode all the…

5 条评论
Data Engineering com Azure Databricks - Parte 2

2021年3月14日

Data Engineering com Azure Databricks - Parte 2

Esta é a segunda parte do estudo que fizemos para fazer a ingest?o de um json de 2gb com reclama??es de consumidores…

11 条评论
Data Engineering na plataforma Azure - Parte 1

2021年2月28日

Data Engineering na plataforma Azure - Parte 1

Na pós gradua??o que estou fazendo criamos um grupo de estudos onde de tempos em tempos algum aluno fala sobre um tema…

18 条评论
Regress?o linear simples - criando uma fun??o em python

2020年8月15日

Regress?o linear simples - criando uma fun??o em python

A título de curiosidade caso alguém n?o saiba como é definida a equa??o de regress?o segue um breve relato e exemplo em…

5 条评论

See all articles

Exploding arrays in Spark

Afonso Orgino Lenzi

| Databricks | Data Engineering |

All columns + explode knownlanguages + drop unwanted columns

领英推荐

Selected columns + explode knownlanguages

Afonso Orgino Lenzi的更多文章

社区洞察

其他会员也浏览了

Syntactic Sugar in Spark Scala Codebase - Part 1

All Data and AI Weekly #176 - 10-Feb-2025

Using Requires Expression in C++20 as a Standalone Feature

Choosing the Right Collection Type in?Rust

All Data and AI Weekly #177 - 17-Feb-2025

Super Excited June With Snowflake Capabilities

C# Primitive Types and Variables

Investigating the effect of Company Announcements on their Share Price following COVID-19 (using the S&P 500)

Improved Role Assignment Report REST API

Finding the Minimum Depth of a Binary Tree

All columns + explode knownlanguages + drop unwanted columns

领英推荐

Selected columns + explode knownlanguages

Afonso Orgino Lenzi的更多文章

Unix Timestamp and Date functions at Spark

Using lambda functions on spark dataframe

Dealing with nested arrays in Spark

Data Engineering com Azure Databricks - Parte 2

Data Engineering na plataforma Azure - Parte 1

Regress?o linear simples - criando uma fun??o em python

社区洞察

其他会员也浏览了

Syntactic Sugar in Spark Scala Codebase - Part 1

All Data and AI Weekly #176 - 10-Feb-2025

Using Requires Expression in C++20 as a Standalone Feature

Choosing the Right Collection Type in?Rust

All Data and AI Weekly #177 - 17-Feb-2025

Super Excited June With Snowflake Capabilities

C# Primitive Types and Variables

Investigating the effect of Company Announcements on their Share Price following COVID-19 (using the S&P 500)

Improved Role Assignment Report REST API

Finding the Minimum Depth of a Binary Tree