Pyspark: Pass multiple columns in UDF

前端 未结 6 633
有刺的猬
有刺的猬 2020-11-30 02:47

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes

6条回答
  •  無奈伤痛
    2020-11-30 03:53

    Use struct instead of array

    from pyspark.sql.types import IntegerType
    from pyspark.sql.functions import udf, struct
    sum_cols = udf(lambda x: x[0]+x[1], IntegerType())
    a=spark.createDataFrame([(101, 1, 16)], ['ID', 'A', 'B'])
    a.show()
    a.withColumn('Result', sum_cols(struct('A', 'B'))).show()
    

提交回复
热议问题