Add column sum as new column in PySpark dataframe

前端 未结 8 2060
粉色の甜心
粉色の甜心 2020-12-02 22:43

I\'m using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.

Suppose my datafram

8条回答
  •  甜味超标
    2020-12-02 23:06

    The solution

    newdf = df.withColumn('total', sum(df[col] for col in df.columns))
    

    posted by @Paul works. Nevertheless I was getting the error, as many other as I have seen,

    TypeError: 'Column' object is not callable
    

    After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line

    from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
    

    so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.

    You can delete the reference of the pyspark function with del sum.

    Otherwise in my case I changed the import to

    import pyspark.sql.functions as F
    

    and then referenced the functions as F.sum.

提交回复
热议问题