Pyspark: Pass multiple columns in UDF

前端 未结 6 643
有刺的猬
有刺的猬 2020-11-30 02:47

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes

6条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-11-30 03:51

    If you don't want to type out all your column names and would rather just dump all the columns into your UDF, you'll need to wrap a list comprehension within a struct.

    from pyspark.sql.functions import struct, udf
    sum_udf = udf(lambda x: sum(x[1:]))
    df_sum = df.withColumn("result", sum_udf(struct([df[col] for col in df.columns])))
    

提交回复
热议问题