Adding a column of rowsums across a list of columns in Spark Dataframe

后端 未结 4 1625
北恋
北恋 2020-12-14 18:03

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.

For example, my data l

4条回答
  •  执笔经年
    2020-12-14 18:18

    Plain and simple:

    import org.apache.spark.sql.Column
    import org.apache.spark.sql.functions.{lit, col}
    
    def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
    
    val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
    df.select(sum_(columnstosum: _*))
    

    with Python equivalent:

    from functools import reduce
    from operator import add
    from pyspark.sql.functions import lit, col
    
    def sum_(*cols):
        return reduce(add, cols, lit(0))
    
    columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
    select("*", sum_(*columnstosum))
    

    Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.

提交回复
热议问题