Add column sum as new column in PySpark dataframe

前端 未结 8 1999
粉色の甜心
粉色の甜心 2020-12-02 22:43

I\'m using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.

Suppose my datafram

相关标签:
8条回答
  • 2020-12-02 23:06

    The solution

    newdf = df.withColumn('total', sum(df[col] for col in df.columns))
    

    posted by @Paul works. Nevertheless I was getting the error, as many other as I have seen,

    TypeError: 'Column' object is not callable
    

    After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line

    from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
    

    so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.

    You can delete the reference of the pyspark function with del sum.

    Otherwise in my case I changed the import to

    import pyspark.sql.functions as F
    

    and then referenced the functions as F.sum.

    0 讨论(0)
  • 2020-12-02 23:11

    A very simple approach would be to just use select instead of withcolumn as below:

    df = df.select('*', (col("a")+col("b")+col('c).alias("total"))

    This should give you required sum with minor changes based on requirements

    0 讨论(0)
  • 2020-12-02 23:11

    The following approach works for me:

    1. Import pyspark sql functions
      from pyspark.sql import functions as F
    2. Use F.expr(list_of_columns)
      data_frame.withColumn('Total_Sum',F.expr('col_name1+col_name2+..col_namen)
    0 讨论(0)
  • 2020-12-02 23:13

    My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:

    import pyspark
    from pyspark.sql import SparkSession
    import pandas as pd
    
    spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
    df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
                                  ,(6,1,-4),(0,2,-2),(6,4,1)\
                                  ,(4,5,2),(5,-3,-5),(6,4,-1)]\
                                  ,schema=['x1','x2','x3'])
    df.show()
    
    +---+---+---+
    | x1| x2| x3|
    +---+---+---+
    |  1|  2|  3|
    |  4|  5|  6|
    |  3|  2|  1|
    |  6|  1| -4|
    |  0|  2| -2|
    |  6|  4|  1|
    |  4|  5|  2|
    |  5| -3| -5|
    |  6|  4| -1|
    +---+---+---+
    
    colnames=df.columns
    

    add new columns that are cumulative sums (consecutive):

    for i in range(0,len(colnames)):
        colnameLst= colnames[0:i+1]
        colname = 'cm'+ str(i+1)
        df = df.withColumn(colname, sum(df[col] for col in colnameLst))
    

    df.show()

    +---+---+---+---+---+---+
    | x1| x2| x3|cm1|cm2|cm3|
    +---+---+---+---+---+---+
    |  1|  2|  3|  1|  3|  6|
    |  4|  5|  6|  4|  9| 15|
    |  3|  2|  1|  3|  5|  6|
    |  6|  1| -4|  6|  7|  3|
    |  0|  2| -2|  0|  2|  0|
    |  6|  4|  1|  6| 10| 11|
    |  4|  5|  2|  4|  9| 11|
    |  5| -3| -5|  5|  2| -3|
    |  6|  4| -1|  6| 10|  9|
    +---+---+---+---+---+---+
    

    'cumulative sum' columns added are as follows:

    cm1 = x1
    cm2 = x1 + x2
    cm3 = x1 + x2 + x3
    
    0 讨论(0)
  • 2020-12-02 23:14
    df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))
    
    df.show()
    
    +--------+--------+--------+
    |Columna1|Columna2|Columna3|
    +--------+--------+--------+
    |  linha1|  valor1|       2|
    |  linha2|  valor2|       5|
    +--------+--------+--------+
    
    df = df.withColumn('DivisaoPorDois', df[2]/2)
    df.show()
    
    +--------+--------+--------+--------------+
    |Columna1|Columna2|Columna3|DivisaoPorDois|
    +--------+--------+--------+--------------+
    |  linha1|  valor1|       2|           1.0|
    |  linha2|  valor2|       5|           2.5|
    +--------+--------+--------+--------------+
    
    df = df.withColumn('Soma_Colunas', df[2]+df[3])
    df.show()
    
    +--------+--------+--------+--------------+------------+
    |Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
    +--------+--------+--------+--------------+------------+
    |  linha1|  valor1|       2|           1.0|         3.0|
    |  linha2|  valor2|       5|           2.5|         7.5|
    +--------+--------+--------+--------------+------------+
    
    0 讨论(0)
  • 2020-12-02 23:18

    The most straight forward way of doing it is to use the expr function

    from pyspark.sql.functions import *
    data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))
    
    0 讨论(0)
提交回复
热议问题