问题
I have the following data.frame:
grp nr yr
1: A 1.0 2009
2: A 2.0 2009
3: A 1.5 2009
4: A 1.0 2010
5: B 3.0 2009
6: B 2.0 2010
7: B NA 2011
8: C 3.0 2014
9: C 3.0 2019
10: C 3.0 2020
11: C 4.0 2021
Desired output:
grp nr yr nr_roll_period_3
1 A 1.0 2009 NA
2 A 2.0 2009 NA
3 A 1.5 2009 NA
4 A 1.0 2010 NA
5 B 3.0 2009 NA
6 B 2.0 2010 NA
7 B NA 2011 NA
8 C 3.0 2014 NA
9 C 3.0 2019 NA
10 C 3.0 2020 NA
11 C 4.0 2021 3.333333
The logic:
- I want to calculate a rolling mean for the period of length k (let's say 3), where 3 includes the current month/year/day (by group)
- However, this shouldn't calculate anything where there is no 3 consecutive years/months/days
- Likewise, whenever there is NA in the column for calculation within this period, the output should be NA.
Currently I have this function:
calculate_rolling_window <-
function(dt, date_col, calc_col, id, k) {
require(data.table)
return(setDT(dt)[
, paste(calc_col, "roll_period", k, sep = "_") :=
sapply(get(date_col), function(x) mean(get(calc_col)[between(get(date_col), x - k + 1, x)])),
by = mget(id)])
}
It works fine for the regular cases, where there is no duplicates in the date column. However, with duplicates it fails:
grp nr yr nr_roll_period_3
1: A 1.0 2009 1.500000
2: A 2.0 2009 1.500000
3: A 1.5 2009 1.500000
4: A 1.0 2010 1.375000
5: B 3.0 2009 NA
6: B 2.0 2010 NA
7: B NA 2011 NA
8: C 3.0 2014 NA
9: C 3.0 2019 NA
10: C 3.0 2020 NA
11: C 4.0 2021 3.333333
Any ideas on how to handle this? No need for exclusively data.table approach.
回答1:
This can be solved by grouping in a non-equi join to aggregate over a rolling window of length k, filtering for k consecutive years, and an update join:
library(data.table)
k <- 3L
# group by join parameters of a non-equi join
mDT <- setDT(DT)[.(grp = grp, upper = yr, lower = yr - k),
on = .(grp, yr <= upper, yr > lower),
.(uniqueN(x.yr), mean(nr)), by = .EACHI]
# update join with filtered intermediate result
DT[mDT[V1 == k], on = .(grp, yr), paste0("nr_roll_period_", k) := V2]
DT
which returns OP's expected result:
grp nr yr nr_roll_period 1: A 1.0 2009 NA 2: A 2.0 2009 NA 3: A 1.5 2009 NA 4: A 1.0 2010 NA 5: B 3.0 2009 NA 6: B 2.0 2010 NA 7: B NA 2011 NA 8: C 3.0 2014 NA 9: C 3.0 2019 NA 10: C 3.0 2020 NA 11: C 4.0 2021 3.333333
The intermediate result mDT contains the rolling mean V2 over k periods and the count of unique/distinct years V1 within each period. It is created by a non-equi join of DT with a data.table containing the upper and lower bounds which is created on-the-fly by .(grp = grp, upper = yr, lower = yr - k).
mDT
grp yr yr V1 V2 1: A 2009 2006 1 1.500000 2: A 2009 2006 1 1.500000 3: A 2009 2006 1 1.500000 4: A 2010 2007 2 1.375000 5: B 2009 2006 1 3.000000 6: B 2010 2007 2 2.500000 7: B 2011 2008 3 NA 8: C 2014 2011 1 3.000000 9: C 2019 2016 1 3.000000 10: C 2020 2017 2 3.000000 11: C 2021 2018 3 3.333333
This is filtered for rows which contain exactly k distinct years:
mDT[V1 == k]
grp yr yr V1 V2 1: B 2011 2008 3 NA 2: C 2021 2018 3 3.333333
Finally, this is joined with DT to append the new column to DT.
Note, that mean() returns NA by default if there is an NA in the input data.
Data
library(data.table)
DT <- fread(text = "rn grp nr yr
1: A 1.0 2009
2: A 2.0 2009
3: A 1.5 2009
4: A 1.0 2010
5: B 3.0 2009
6: B 2.0 2010
7: B NA 2011
8: C 3.0 2014
9: C 3.0 2019
10: C 3.0 2020
11: C 4.0 2021", drop = 1L)
来源:https://stackoverflow.com/questions/52725965/rolling-window-function-for-irregular-time-series-that-can-handle-duplicates