Add column sum as new column in PySpark dataframe

前端未结

关注

 8  2046

I\'m using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.

Suppose my datafram

相关标签:

8条回答

甜味超标

2020-12-02 23:06
The solution
```
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
```
posted by @Paul works. Nevertheless I was getting the error, as many other as I have seen,
```
TypeError: 'Column' object is not callable
```
After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line
```
from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
```
so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.

You can delete the reference of the pyspark function with del sum.

Otherwise in my case I changed the import to
```
import pyspark.sql.functions as F
```
and then referenced the functions as F.sum.
0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-02 23:11

A very simple approach would be to just use select instead of withcolumn as below:

df = df.select('*', (col("a")+col("b")+col('c).alias("total"))

This should give you required sum with minor changes based on requirements

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-12-02 23:11
The following approach works for me:
1. Import pyspark sql functions
  from pyspark.sql import functions as F
2. Use F.expr(list_of_columns)
  data_frame.withColumn('Total_Sum',F.expr('col_name₁+col_name₂+..col_name_n)
0 讨论(0)
发布评论:

提交评论
- 加载中...

说谎

2020-12-02 23:13

My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:

import pyspark
from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
                              ,(6,1,-4),(0,2,-2),(6,4,1)\
                              ,(4,5,2),(5,-3,-5),(6,4,-1)]\
                              ,schema=['x1','x2','x3'])
df.show()

+---+---+---+
| x1| x2| x3|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  3|  2|  1|
|  6|  1| -4|
|  0|  2| -2|
|  6|  4|  1|
|  4|  5|  2|
|  5| -3| -5|
|  6|  4| -1|
+---+---+---+

colnames=df.columns

add new columns that are cumulative sums (consecutive):

for i in range(0,len(colnames)):
    colnameLst= colnames[0:i+1]
    colname = 'cm'+ str(i+1)
    df = df.withColumn(colname, sum(df[col] for col in colnameLst))

df.show()

+---+---+---+---+---+---+
| x1| x2| x3|cm1|cm2|cm3|
+---+---+---+---+---+---+
|  1|  2|  3|  1|  3|  6|
|  4|  5|  6|  4|  9| 15|
|  3|  2|  1|  3|  5|  6|
|  6|  1| -4|  6|  7|  3|
|  0|  2| -2|  0|  2|  0|
|  6|  4|  1|  6| 10| 11|
|  4|  5|  2|  4|  9| 11|
|  5| -3| -5|  5|  2| -3|
|  6|  4| -1|  6| 10|  9|
+---+---+---+---+---+---+

'cumulative sum' columns added are as follows:

cm1 = x1
cm2 = x1 + x2
cm3 = x1 + x2 + x3

0 讨论(0)

广开言路

2020-12-02 23:14

df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))

df.show()

+--------+--------+--------+
|Columna1|Columna2|Columna3|
+--------+--------+--------+
|  linha1|  valor1|       2|
|  linha2|  valor2|       5|
+--------+--------+--------+

df = df.withColumn('DivisaoPorDois', df[2]/2)
df.show()

+--------+--------+--------+--------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|
+--------+--------+--------+--------------+
|  linha1|  valor1|       2|           1.0|
|  linha2|  valor2|       5|           2.5|
+--------+--------+--------+--------------+

df = df.withColumn('Soma_Colunas', df[2]+df[3])
df.show()

+--------+--------+--------+--------------+------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
+--------+--------+--------+--------------+------------+
|  linha1|  valor1|       2|           1.0|         3.0|
|  linha2|  valor2|       5|           2.5|         7.5|
+--------+--------+--------+--------------+------------+

0 讨论(0)

庸人自扰

2020-12-02 23:18
The most straight forward way of doing it is to use the expr function
```
from pyspark.sql.functions import *
data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页