Pyspark: Pass multiple columns in UDF

前端未结

关注

 6  639

有刺的猬 2020-11-30 02:47

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes

6条回答

猫巷女王i (楼主)

2020-11-30 03:39

Maybe it's a late answer, but I don't like using UDFs without necessity, so:

from pyspark.sql.functions import col
from functools import reduce
data = [["a",1,2,5],["b",2,3,7],["c",3,4,8]]
df = spark.createDataFrame(data,["id","v1","v2",'v3'])

calculate = reduce(lambda a, x: a+x, map(col, ["v1","v2",'v3']))

df.withColumn("Result", calculate)
#
#id v1  v2  v3  Result
#a  1   2   5   8
#b  2   3   7   12
#c  3   4   8   15

Here u could to use any operation which implement in Column. Also if u want to write a custom udf with specific logic, u could use it, because Column provide tree execution operations. Without collecting to array and sum on it.

If compared with process as array operations, it will be bad from performance perspective, let's take a look at the physical plan, in my case and array case, in my case and array cased.

my case:

== Physical Plan ==
*(1) Project [id#355, v1#356L, v2#357L, v3#358L, ((v1#356L + v2#357L) + v3#358L) AS Result#363L]
+- *(1) Scan ExistingRDD[id#355,v1#356L,v2#357L,v3#358L]

array case:

== Physical Plan ==
*(2) Project [id#339, v1#340L, v2#341L, v3#342L, pythonUDF0#354 AS Result#348]
+- BatchEvalPython [(array(v1#340L, v2#341L, v3#342L))], [pythonUDF0#354]
   +- *(1) Scan ExistingRDD[id#339,v1#340L,v2#341L,v3#342L]

When possible - we need to avoid using UDFs as Catalyst does not know how to optimize those

0 讨论(0)

查看其它6个回答