登录查看更多内容

Using lambda functions on spark dataframe

Afonso Orgino Lenzi

| Databricks | Data Engineering |

发布日期: 2024年7月16日

A lambda function is a small anonymous function. The basic sintaxe to use it in python is:

lambda arguments : expression

So one basic example could be:

x = lambda a : a + 20
print(x(5))

which would return 25 in this example.

We can also use lambda on spark dataframes, it is pretty handy when we are creating new collumns based on existing collumn on the dataframe.

Lets say we have this very basic dataframe:

columns = ["Seqno","Name","x","y"]
data = [("1", "john jones",10,11),
    ("2", "tracey smith",20,21),
    ("3", "amy sanders",30,31)]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)

lets imagine a scenario where we need suming collumns X and Y in this example we can do it like this using lambda.

df.withColumn("z", udf(lambda x, y: x + y)("x", "y"))

The output will be something like this:

The phisical plan generated with explain(False) is:

== Physical Plan == (2) Project [Seqno#46, Name#47, x#48L, y#49L, pythonUDF0#202 AS z#196] +- BatchEvalPython [<lambda>(x#48L, y#49L)#195], [pythonUDF0#202] +- (1) Scan ExistingRDD[Seqno#46,Name#47,x#48L,y#49L]

This approach is kinda shortcut approach for the regular one registering the function.

Just for a comparison , to register the function and apply it in the dataframe we could do it as it follows:

sumxy = udf(lambda x,y:x+y,IntegerType())   

df.withColumn("Z", sumxy(col("x"),col("y"))) \
  .show(truncate=False)

the output is generated simillarly:

The phisical plan generated once we registered the function is the same using directly lambda:

Performance is also pretty much the same.

Thiago Menezes

8 个月

Sensacional o Artigo!! obrigado por compartilhar!!

1 次回应

Otacilio P S Filho

Building Scalable Data Solutions | Data Engineering | Python, Spark, Azure & Databricks

8 个月

Very nice!

1 次回应

查看更多评论

要查看或添加评论，请登录

Afonso Orgino Lenzi的更多文章

Unix Timestamp and Date functions at Spark

2024年7月23日

Unix Timestamp and Date functions at Spark

Unix time is a method to represent a timestamp, and is usually defined as the number of seconds since the beginning of…

3 条评论
Exploding arrays in Spark

2023年4月15日

Exploding arrays in Spark

When we perform a "explode" function into a dataframe we are focusing on a particular column, but in this dataframe…
Dealing with nested arrays in Spark

2023年3月25日

Dealing with nested arrays in Spark

Lets supose you receive a data frame with nested arrays like this bellow , and you are asked to explode all the…

5 条评论
Data Engineering com Azure Databricks - Parte 2

2021年3月14日

Data Engineering com Azure Databricks - Parte 2

Esta é a segunda parte do estudo que fizemos para fazer a ingest?o de um json de 2gb com reclama??es de consumidores…

11 条评论
Data Engineering na plataforma Azure - Parte 1

2021年2月28日

Data Engineering na plataforma Azure - Parte 1

Na pós gradua??o que estou fazendo criamos um grupo de estudos onde de tempos em tempos algum aluno fala sobre um tema…

18 条评论
Regress?o linear simples - criando uma fun??o em python

2020年8月15日

Regress?o linear simples - criando uma fun??o em python

A título de curiosidade caso alguém n?o saiba como é definida a equa??o de regress?o segue um breve relato e exemplo em…

5 条评论

See all articles

Afonso Orgino Lenzi的更多文章

Unix Timestamp and Date functions at Spark

Exploding arrays in Spark

Dealing with nested arrays in Spark

Data Engineering com Azure Databricks - Parte 2

Data Engineering na plataforma Azure - Parte 1

Regress?o linear simples - criando uma fun??o em python