问题
I have the following sample DataFrame:
rdd = sc.parallelize([(1,20), (2,30), (3,30)])
df2 = spark.createDataFrame(rdd, ["id", "duration"])
df2.show()
+---+--------+
| id|duration|
+---+--------+
| 1| 20|
| 2| 30|
| 3| 30|
+---+--------+
I want to sort this DataFrame in desc order of duration and add a new column which has the cumulative sum of the duration. So I did the following:
windowSpec = Window.orderBy(df2['duration'].desc())
df_cum_sum = df2.withColumn("duration_cum_sum", sum('duration').over(windowSpec))
df_cum_sum.show()
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 60|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
My desired output is:
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 30|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
How do I get this?
Here is the breakdown:
+--------+----------------+
|duration|duration_cum_sum|
+--------+----------------+
| 30| 30| #First value
| 30| 60| #Current duration + previous cum sum value
| 20| 80| #Current duration + previous cum sum value
+--------+----------------+
回答1:
You can introduce the row_number
to break the ties; If written in sql
:
df2.selectExpr(
"id", "duration",
"sum(duration) over (order by row_number() over (order by duration desc)) as duration_cum_sum"
).show()
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 30|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
回答2:
Here you can check this
df2.withColumn('cumu', F.sum('duration').over(Window.orderBy(F.col('duration').desc()).rowsBetween(Window.unboundedPreceding, 0)
)).show()
来源:https://stackoverflow.com/questions/46979685/calculating-cumulative-sum-in-pyspark-using-window-functions