可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes have 3 columns or 4 columns or more. It will vary.

I know I can hard code 4 column names as pass in the UDF but in this case it will vary so I would like to know how to get it done?

Here are two examples in the first one we have two columns to add and in the second one we have three columns to add.

回答1:

If all columns you want to pass to UDF have the same data type you can use array as input parameter, for example:

>>> from pyspark.sql.types import IntegerType >>> from pyspark.sql.functions import udf, array >>> sum_cols = udf(lambda arr: sum(arr), IntegerType()) >>> spark.createDataFrame([(101, 1, 16)], ['ID', 'A', 'B']) \ ...     .withColumn('Result', sum_cols(array('A', 'B'))).show() +---+---+---+------+ | ID|  A|  B|Result| +---+---+---+------+ |101|  1| 16|    17| +---+---+---+------+  >>> spark.createDataFrame([(101, 1, 16, 8)], ['ID', 'A', 'B', 'C'])\ ...     .withColumn('Result', sum_cols(array('A', 'B', 'C'))).show() +---+---+---+---+------+ | ID|  A|  B|  C|Result| +---+---+---+---+------+ |101|  1| 16|  8|    25| +---+---+---+---+------+

回答2:

Use struct instead of array

from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf, struct sum_cols = udf(lambda x: x[0]+x[1], IntegerType()) a=spark.createDataFrame([(101, 1, 16)], ['ID', 'A', 'B']) a.show() a.withColumn('Result', sum_cols(struct('A', 'B'))).show()

文章来源: Pyspark: Pass multiple columns in UDF

标签

pass

array