问题
I have a spark dataframe where, I want to calculate a running total based on current row Amount value and Previous row sum of Amount value based on groupid and id. Let me put out the df
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext
data1 = {'date': {0: '2018-04-03', 1: '2018-04-04', 2: '2018-04-05', 3: '2018-04-06', 4: '2018-04-07'},
'id': {0: 'id1', 1: 'id2', 2: 'id1', 3: 'id3', 4: 'id2'},
'group': {0: '1', 1: '1', 2: '1', 3: '2', 4: '1'},
'amount': {0: 50, 1: 40, 2: 50, 3: 55, 4: 20}}
df1_pd = pd.DataFrame(data1, columns=data1.keys())
df1 = spark.createDataFrame(df1_pd)
df1.show()
+----------+---+-----+------+
| date| id|group|amount|
+----------+---+-----+------+
|2018-04-03|id1| 1| 50|
|2018-04-04|id2| 1| 40|
|2018-04-05|id1| 1| 50|
|2018-04-06|id3| 2| 55|
|2018-04-07|id2| 1| 20|
+----------+---+-----+------+
the out put I am looking for
+----------+---+-----+------+---+
| date| id|group|amount|sum|
+----------+---+-----+------+---+
|2018-04-03|id1| 1| 50|50 |
|2018-04-04|id2| 1| 40|90 |
|2018-04-05|id1| 1| 50|140|
|2018-04-06|id3| 2| 55|55 |
|2018-04-07|id2| 1| 20|160|
+----------+---+-----+------+---+
回答1:
Window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import sum
w = Window.partitionBy("group").orderBy("date").rowsBetween(
Window.unboundedPreceding, # Take all rows from the beginning of frame
Window.currentRow # To current row
)
Sum:
(df1.withColumn("sum", sum("amount").over(w))
.orderBy("date") # Sort for easy inspection. Not necessary
.show())
Result:
+----------+---+-----+------+---+
| date| id|group|amount|sum|
+----------+---+-----+------+---+
|2018-04-03|id1| 1| 50| 50|
|2018-04-04|id2| 1| 40| 90|
|2018-04-05|id1| 1| 50|140|
|2018-04-06|id3| 2| 55| 55|
|2018-04-07|id2| 1| 20|160|
+----------+---+-----+------+---+
来源:https://stackoverflow.com/questions/50144313/adding-previous-row-with-current-row-using-window-function