Cumulative sum in Spark

后端 未结 1 429
暖寄归人
暖寄归人 2020-12-09 22:53

I want to do cumulative sum in Spark. Here is the register table (input):

+---------------+-------------------+----+----+----+
|     product_id|          dat         


        
相关标签:
1条回答
  • 2020-12-09 23:29

    To get the cumulative sum using the DataFrame API you should use the rowsBetween window method. In Spark 2.1 and newer create the window as follows:

    val w = Window.partitionBy($"product_id", $"ack")
      .orderBy($"date_time")
      .rowsBetween(Window.unboundedPreceding, Window.currentRow)
    

    This will tell Spark to use the values from the beginning of the partition until the current row. Using older versions of Spark, use rowsBetween(Long.MinValue, 0) for the same effect.

    To use the window, use the same method as before:

    val newDf = inputDF.withColumn("val_sum", sum($"val1").over(w))
      .withColumn("val2_sum", sum($"val2").over(w))
    
    0 讨论(0)
提交回复
热议问题