I have a data.table
(~30 million rows) consisting of a datetime
column in POSIXct
format, an id
column and a few other co
Using Rcpp:
library(Rcpp)
library(inline)
cppFunction(
'LogicalVector selecttimes(const NumericVector x) {
const int n = x.length();
LogicalVector res(n);
res(0) = true;
double testval = x(0);
for (int i=1; i<n; i++) {
if (x(i) - testval > 30 * 60) {
testval = x(i);
res(i) = true;
}
}
return res;
}')
DT[, keep1 := selecttimes(datetime), by = id]
DT[, all(keep == keep1)]
#[1] TRUE
Some additional testing should be done, it needs input validation, and the time difference could be made a parameter.
Here's what I would do:
setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
s = 0L
w = DT[, .I[1L], by=id]$V1
while (length(w)){
s = s + 1L
DT[w, tag := s]
m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}
which gives
datetime x id keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4
The idea behind it is described by the OP in a comment:
per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on
# create an index column
DT[, idx := 1:.N, by = id]
# find the indices of the matching future dates
DT[, fut.idx := DT[.(id = id, datetime = datetime+30*60), on = c('id', 'datetime')
, idx, roll = -Inf]]
# datetime x id keep difftime idx fut.idx
# 1: 2016-04-28 09:20:18 0.02461368 1 TRUE 0.0000000 mins 1 4
# 2: 2016-04-28 09:41:34 0.88953932 1 FALSE 21.2666667 mins 2 6
# 3: 2016-04-28 09:46:07 0.31818101 1 FALSE 25.8166667 mins 3 6
# 4: 2016-04-28 10:00:56 0.14711365 1 TRUE 40.6333333 mins 4 6
# 5: 2016-04-28 10:09:11 0.54406602 1 FALSE 48.8833333 mins 5 7
# 6: 2016-04-28 10:39:09 0.69280341 1 TRUE 78.8500000 mins 6 NA
# 7: 2016-04-28 10:50:01 0.99426978 1 FALSE 89.7166667 mins 7 NA
# 8: 2016-04-28 10:51:46 0.47779597 1 FALSE 91.4666667 mins 8 NA
# 9: 2016-04-28 10:57:58 0.23162579 1 FALSE 97.6666667 mins 9 NA
#10: 2016-04-28 10:58:23 0.96302423 1 FALSE 98.0833333 mins 10 NA
#11: 2016-04-28 09:13:19 0.21640794 2 TRUE 0.0000000 mins 1 4
#12: 2016-04-28 09:13:44 0.70853047 2 FALSE 0.4166667 mins 2 4
#13: 2016-04-28 09:36:44 0.75845954 2 FALSE 23.4166667 mins 3 6
#14: 2016-04-28 09:55:31 0.64050681 2 TRUE 42.2000000 mins 4 8
#15: 2016-04-28 10:00:33 0.90229905 2 FALSE 47.2333333 mins 5 9
#16: 2016-04-28 10:11:51 0.28915974 2 FALSE 58.5333333 mins 6 9
#17: 2016-04-28 10:14:14 0.79546742 2 FALSE 60.9166667 mins 7 9
#18: 2016-04-28 10:26:17 0.69070528 2 TRUE 72.9666667 mins 8 10
#19: 2016-04-28 10:51:02 0.59414202 2 FALSE 97.7166667 mins 9 NA
#20: 2016-04-28 10:56:36 0.65570580 2 TRUE 103.2833333 mins 10 NA
# at this point the problem is "solved", but you still have to extract the solution
# and that's the more complicated part
DT[, keep.new := FALSE]
# iterate over the matching indices (jumping straight to the correct one)
DT[, {
next.idx = 1
while(!is.na(next.idx)) {
set(DT, .I[next.idx], 'keep.new', TRUE)
next.idx = fut.idx[next.idx]
}
}, by = id]
DT[, identical(keep, keep.new)]
#[1] TRUE
Alternatively for the last step, you can do (this will iterate over the entire thing, but I don't know what the speed impact would be):
DT[, keep.3 := FALSE]
DT[DT[, .I[na.omit(Reduce(function(x, y) fut.idx[x], c(1, fut.idx), accumulate = T))]
, by = id]$V1
, keep.3 := TRUE]
DT[, identical(keep, keep.3)]
#[1] TRUE