I have a data.table (~30 million rows) consisting of a datetime column in POSIXct format, an id column and a few other co
Here's what I would do:
setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
s = 0L
w = DT[, .I[1L], by=id]$V1
while (length(w)){
s = s + 1L
DT[w, tag := s]
m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}
which gives
datetime x id keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4
The idea behind it is described by the OP in a comment:
per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on