I want to group a vector based on the sum of the elements being less than or equal to n
. Assume the following,
set.seed(1)
x <- sample(10, 20
This works, but can probably be improved:
x <- c(3L, 4L, 6L, 10L, 3L, 9L, 10L, 7L, 7L, 1L, 3L, 2L, 7L, 4L, 8L, 5L, 8L, 10L, 4L, 8L)
y <- as.integer(c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10))
n = 15
library(data.table)
DT = data.table(x,y)
DT[, xc := cumsum(x)]
b = DT[.(shift(xc, fill=0) + n + 1), on=.(xc), roll=-Inf, which=TRUE]
z = 1; res = logical(length(x))
while (!is.na(z) && z <= length(x)){
res[z] <- TRUE
z <- b[z]
}
DT[, g := cumsum(res)]
x y xc g
1: 3 1 3 1
2: 4 1 7 1
3: 6 1 13 1
4: 10 2 23 2
5: 3 2 26 2
6: 9 3 35 3
7: 10 4 45 4
8: 7 5 52 5
9: 7 5 59 5
10: 1 5 60 5
11: 3 6 63 6
12: 2 6 65 6
13: 7 6 72 6
14: 4 7 76 7
15: 8 7 84 7
16: 5 8 89 8
17: 8 8 97 8
18: 10 9 107 9
19: 4 9 111 9
20: 8 10 119 10
DT[, all(y == g)] # TRUE
How it works
The rolling join asks "if this is the start of a group, where will the next one start?" Then you can iterate over the result, starting from the first position, to find all the groups.
The last line DT[, g := cumsum(res)]
could also be done as a rolling join (maybe faster?):
DT[, g := data.table(r = which(res))[, g := .I][.(.I), on=.(r), roll=TRUE, x.g ]]