问题
In a dataframe the columns have the input shown below:
| id| priority| status| datetime|data_as_of_Date|Amount|open_close|
| 1|Unassigned| Fixed| 10/8/2019 0:00| 2/12/2020 0:00| 40| Closed|
| 1|Unassigned| New|2/12/2019 11:00| 2/12/2020 0:00| 20| Open|
| 1|Unassigned|Fix in progress|9/12/2019 11:00| 2/12/2020 0:00| 90| Open|
| 3| Critical| Removed|5/17/2019 12:00| 2/12/2020 0:00| 33| Closed|
| 3|Unassigned|Fix in progress|5/26/2019 10:00| 2/12/2020 0:00| 30| Open|
| 3| Critical| New| 5/8/2019 3:00| 2/12/2020 0:00| 34| Open|
| 3|Unassigned| Fixed| 7/29/2019 7:00| 2/12/2020 0:00| 29| Closed|
How would I calculate the count of how many times the open_close
column got changed per company?
回答1:
You can use window functions to add row number using your date column. Then use lag function to create a new column that shifts down one position and if open_close value is different than the previous one puts '1' otherwise putting '0'. Finally, group by company id and sum changes marked as 1.
val df2 = df.withColumn("row_num",row_number.over(Window.orderBy('datetime).partitionBy('id)))
val df3 = df2.select('*,lag('open_close, 1, 0).over(Window.orderBy('row_num).partitionBy('id)).as("lag"))
val df4 = df3.select('*,when('open_close === 'lag || 'lag === 0 , 0).otherwise(1).as("change"))
df4.groupBy('id).agg(sum('change)).show()
+---+-----------+
| id|sum(change)|
+---+-----------+
| 1| 1|
| 3| 2|
+---+-----------+
来源:https://stackoverflow.com/questions/61687176/how-to-count-record-changes-for-a-particular-value-of-a-column-in-a-scala-datafr