Using lambda functions on spark dataframe
A lambda function is a small anonymous function. The basic sintaxe to use it in python is:
lambda arguments : expression
So one basic example could be:
x = lambda a : a + 20
print(x(5))
which would return 25 in this example.
We can also use lambda on spark dataframes, it is pretty handy when we are creating new collumns based on existing collumn on the dataframe.
Lets say we have this very basic dataframe:
columns = ["Seqno","Name","x","y"]
data = [("1", "john jones",10,11),
("2", "tracey smith",20,21),
("3", "amy sanders",30,31)]
df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)
lets imagine a scenario where we need suming collumns X and Y in this example we can do it like this using lambda.
df.withColumn("z", udf(lambda x, y: x + y)("x", "y"))
The output will be something like this:
The phisical plan generated with explain(False) is:
== Physical Plan == (2) Project [Seqno#46, Name#47, x#48L, y#49L, pythonUDF0#202 AS z#196] +- BatchEvalPython [<lambda>(x#48L, y#49L)#195], [pythonUDF0#202] +- (1) Scan ExistingRDD[Seqno#46,Name#47,x#48L,y#49L]
This approach is kinda shortcut approach for the regular one registering the function.
Just for a comparison , to register the function and apply it in the dataframe we could do it as it follows:
sumxy = udf(lambda x,y:x+y,IntegerType())
df.withColumn("Z", sumxy(col("x"),col("y"))) \
.show(truncate=False)
the output is generated simillarly:
The phisical plan generated once we registered the function is the same using directly lambda:
== Physical Plan == (2) Project [Seqno#46, Name#47, x#48L, y#49L, pythonUDF0#202 AS z#196] +- BatchEvalPython [<lambda>(x#48L, y#49L)#195], [pythonUDF0#202] +- (1) Scan ExistingRDD[Seqno#46,Name#47,x#48L,y#49L]
Performance is also pretty much the same.
Sensacional o Artigo!! obrigado por compartilhar!!
Building Scalable Data Solutions | Data Engineering | Python, Spark, Azure & Databricks
8 个月Very nice!