Pyspark: Pass multiple columns in UDF

前端 未结 6 642
有刺的猬
有刺的猬 2020-11-30 02:47

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes

6条回答
  •  伪装坚强ぢ
    2020-11-30 03:43

    If all columns you want to pass to UDF have the same data type you can use array as input parameter, for example:

    >>> from pyspark.sql.types import IntegerType
    >>> from pyspark.sql.functions import udf, array
    >>> sum_cols = udf(lambda arr: sum(arr), IntegerType())
    >>> spark.createDataFrame([(101, 1, 16)], ['ID', 'A', 'B']) \
    ...     .withColumn('Result', sum_cols(array('A', 'B'))).show()
    +---+---+---+------+
    | ID|  A|  B|Result|
    +---+---+---+------+
    |101|  1| 16|    17|
    +---+---+---+------+
    
    >>> spark.createDataFrame([(101, 1, 16, 8)], ['ID', 'A', 'B', 'C'])\
    ...     .withColumn('Result', sum_cols(array('A', 'B', 'C'))).show()
    +---+---+---+---+------+
    | ID|  A|  B|  C|Result|
    +---+---+---+---+------+
    |101|  1| 16|  8|    25|
    +---+---+---+---+------+
    

提交回复
热议问题