Pyspark: Pass multiple columns in UDF

匿名 (未验证) 提交于 2019-12-03 01:27:01

问题:

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes have 3 columns or 4 columns or more. It will vary.

I know I can hard code 4 column names as pass in the UDF but in this case it will vary so I would like to know how to get it done?

Here are two examples in the first one we have two columns to add and in the second one we have three columns to add.

回答1:

If all columns you want to pass to UDF have the same data type you can use array as input parameter, for example:

>>> from pyspark.sql.types import IntegerType >>> from pyspark.sql.functions import udf, array >>> sum_cols = udf(lambda arr: sum(arr), IntegerType()) >>> spark.createDataFrame([(101, 1, 16)], ['ID', 'A', 'B']) \ ...     .withColumn('Result', sum_cols(array('A', 'B'))).show() +---+---+---+------+ | ID|  A|  B|Result| +---+---+---+------+ |101|  1| 16|    17| +---+---+---+------+  >>> spark.createDataFrame([(101, 1, 16, 8)], ['ID', 'A', 'B', 'C'])\ ...     .withColumn('Result', sum_cols(array('A', 'B', 'C'))).show() +---+---+---+---+------+ | ID|  A|  B|  C|Result| +---+---+---+---+------+ |101|  1| 16|  8|    25| +---+---+---+---+------+ 


回答2:

Use struct instead of array

from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf, struct sum_cols = udf(lambda x: x[0]+x[1], IntegerType()) a=spark.createDataFrame([(101, 1, 16)], ['ID', 'A', 'B']) a.show() a.withColumn('Result', sum_cols(struct('A', 'B'))).show() 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!