I want to do cumulative sum in Spark. Here is the register table (input):
+---------------+-------------------+----+----+----+
| product_id| dat
To get the cumulative sum using the DataFrame API you should use the rowsBetween
window method. In Spark 2.1 and newer create the window as follows:
val w = Window.partitionBy($"product_id", $"ack")
.orderBy($"date_time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
This will tell Spark to use the values from the beginning of the partition until the current row. Using older versions of Spark, use rowsBetween(Long.MinValue, 0)
for the same effect.
To use the window, use the same method as before:
val newDf = inputDF.withColumn("val_sum", sum($"val1").over(w))
.withColumn("val2_sum", sum($"val2").over(w))