Using lambda functions on spark dataframe

Using lambda functions on spark dataframe


A lambda function is a small anonymous function. The basic sintaxe to use it in python is:

lambda arguments : expression        

So one basic example could be:

x = lambda a : a + 20
print(x(5))        

which would return 25 in this example.


We can also use lambda on spark dataframes, it is pretty handy when we are creating new collumns based on existing collumn on the dataframe.

Lets say we have this very basic dataframe:

columns = ["Seqno","Name","x","y"]
data = [("1", "john jones",10,11),
    ("2", "tracey smith",20,21),
    ("3", "amy sanders",30,31)]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)        


lets imagine a scenario where we need suming collumns X and Y in this example we can do it like this using lambda.


df.withColumn("z", udf(lambda x, y: x + y)("x", "y"))        

The output will be something like this:


The phisical plan generated with explain(False) is:


== Physical Plan == (2) Project [Seqno#46, Name#47, x#48L, y#49L, pythonUDF0#202 AS z#196] +- BatchEvalPython [<lambda>(x#48L, y#49L)#195], [pythonUDF0#202] +- (1) Scan ExistingRDD[Seqno#46,Name#47,x#48L,y#49L]


This approach is kinda shortcut approach for the regular one registering the function.

Just for a comparison , to register the function and apply it in the dataframe we could do it as it follows:

sumxy = udf(lambda x,y:x+y,IntegerType())   

df.withColumn("Z", sumxy(col("x"),col("y"))) \
  .show(truncate=False)        

the output is generated simillarly:

The phisical plan generated once we registered the function is the same using directly lambda:

== Physical Plan == (2) Project [Seqno#46, Name#47, x#48L, y#49L, pythonUDF0#202 AS z#196] +- BatchEvalPython [<lambda>(x#48L, y#49L)#195], [pythonUDF0#202] +- (1) Scan ExistingRDD[Seqno#46,Name#47,x#48L,y#49L]


Performance is also pretty much the same.

Sensacional o Artigo!! obrigado por compartilhar!!

Otacilio P S Filho

Building Scalable Data Solutions | Data Engineering | Python, Spark, Azure & Databricks

8 个月

Very nice!

要查看或添加评论,请登录

Afonso Orgino Lenzi的更多文章

  • Unix Timestamp and Date functions at Spark

    Unix Timestamp and Date functions at Spark

    Unix time is a method to represent a timestamp, and is usually defined as the number of seconds since the beginning of…

    3 条评论
  • Exploding arrays in Spark

    Exploding arrays in Spark

    When we perform a "explode" function into a dataframe we are focusing on a particular column, but in this dataframe…

  • Dealing with nested arrays in Spark

    Dealing with nested arrays in Spark

    Lets supose you receive a data frame with nested arrays like this bellow , and you are asked to explode all the…

    5 条评论
  • Data Engineering com Azure Databricks - Parte 2

    Data Engineering com Azure Databricks - Parte 2

    Esta é a segunda parte do estudo que fizemos para fazer a ingest?o de um json de 2gb com reclama??es de consumidores…

    11 条评论
  • Data Engineering na plataforma Azure - Parte 1

    Data Engineering na plataforma Azure - Parte 1

    Na pós gradua??o que estou fazendo criamos um grupo de estudos onde de tempos em tempos algum aluno fala sobre um tema…

    18 条评论
  • Regress?o linear simples - criando uma fun??o em python

    Regress?o linear simples - criando uma fun??o em python

    A título de curiosidade caso alguém n?o saiba como é definida a equa??o de regress?o segue um breve relato e exemplo em…

    5 条评论