Adding a column of rowsums across a list of columns in Spark Dataframe

后端 未结 4 1623
北恋
北恋 2020-12-14 18:03

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.

For example, my data l

相关标签:
4条回答
  • 2020-12-14 18:18

    Plain and simple:

    import org.apache.spark.sql.Column
    import org.apache.spark.sql.functions.{lit, col}
    
    def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
    
    val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
    df.select(sum_(columnstosum: _*))
    

    with Python equivalent:

    from functools import reduce
    from operator import add
    from pyspark.sql.functions import lit, col
    
    def sum_(*cols):
        return reduce(add, cols, lit(0))
    
    columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
    select("*", sum_(*columnstosum))
    

    Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.

    0 讨论(0)
  • 2020-12-14 18:21

    Here's an elegant solution using python:

    NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))
    

    Hopefully this will influence something similar in Spark ... anyone?.

    0 讨论(0)
  • 2020-12-14 18:28

    I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This post has the same answer.

    val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
    df.withColumn("sum", sumAll)
    
    0 讨论(0)
  • 2020-12-14 18:41

    You should try the following:

    import org.apache.spark.sql.functions._
    
    val sc: SparkContext = ...
    val sqlContext = new SQLContext(sc)
    
    import sqlContext.implicits._
    
    val input = sc.parallelize(Seq(
      ("a", 5, 7, 9, 12, 13),
      ("b", 6, 4, 3, 20, 17),
      ("c", 4, 9, 4, 6 , 9),
      ("d", 1, 2, 6, 8 , 1)
    )).toDF("ID", "var1", "var2", "var3", "var4", "var5")
    
    val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
    
    val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
    
    output.show()
    

    Then the result is:

    +---+----+----+----+----+----+----+
    | ID|var1|var2|var3|var4|var5|sums|
    +---+----+----+----+----+----+----+
    |  a|   5|   7|   9|  12|  13|  46|
    |  b|   6|   4|   3|  20|  17|  50|
    |  c|   4|   9|   4|   6|   9|  32|
    |  d|   1|   2|   6|   8|   1|  18|
    +---+----+----+----+----+----+----+
    
    0 讨论(0)
提交回复
热议问题