Subset observations that differ by at least 30 minutes time

后端未结

关注

 3  681

I have a data.table (~30 million rows) consisting of a datetime column in POSIXct format, an id column and a few other co

相关标签:

3条回答

爱一瞬间的悲伤

2020-12-15 21:50

Using Rcpp:

library(Rcpp)
library(inline)
cppFunction(
  'LogicalVector selecttimes(const NumericVector x) {
   const int n = x.length();
   LogicalVector res(n);
   res(0) = true;
   double testval = x(0);
   for (int i=1; i<n; i++) {
    if (x(i) - testval > 30 * 60) {
      testval = x(i);
      res(i) = true;
    }
   }
   return res;
  }')

DT[, keep1 := selecttimes(datetime), by = id]

DT[, all(keep == keep1)]
#[1] TRUE

Some additional testing should be done, it needs input validation, and the time difference could be made a parameter.

0 讨论(0)

被撕碎了的回忆

2020-12-15 21:54

Here's what I would do:

setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data

s = 0L
w = DT[, .I[1L], by=id]$V1

while (length(w)){
   s = s + 1L
   DT[w, tag := s]

   m = DT[w, .(id, datetime = datetime+30*60)]
   w = DT[m, which = TRUE, roll=-Inf]
   w = w[!is.na(w)]
}

which gives

               datetime          x id  keep tag
 1: 2016-04-28 10:20:18 0.02461368  1  TRUE   1
 2: 2016-04-28 10:41:34 0.88953932  1 FALSE  NA
 3: 2016-04-28 10:46:07 0.31818101  1 FALSE  NA
 4: 2016-04-28 11:00:56 0.14711365  1  TRUE   2
 5: 2016-04-28 11:09:11 0.54406602  1 FALSE  NA
 6: 2016-04-28 11:39:09 0.69280341  1  TRUE   3
 7: 2016-04-28 11:50:01 0.99426978  1 FALSE  NA
 8: 2016-04-28 11:51:46 0.47779597  1 FALSE  NA
 9: 2016-04-28 11:57:58 0.23162579  1 FALSE  NA
10: 2016-04-28 11:58:23 0.96302423  1 FALSE  NA
11: 2016-04-28 10:13:19 0.21640794  2  TRUE   1
12: 2016-04-28 10:13:44 0.70853047  2 FALSE  NA
13: 2016-04-28 10:36:44 0.75845954  2 FALSE  NA
14: 2016-04-28 10:55:31 0.64050681  2  TRUE   2
15: 2016-04-28 11:00:33 0.90229905  2 FALSE  NA
16: 2016-04-28 11:11:51 0.28915974  2 FALSE  NA
17: 2016-04-28 11:14:14 0.79546742  2 FALSE  NA
18: 2016-04-28 11:26:17 0.69070528  2  TRUE   3
19: 2016-04-28 11:51:02 0.59414202  2 FALSE  NA
20: 2016-04-28 11:56:36 0.65570580  2  TRUE   4

The idea behind it is described by the OP in a comment:

per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

0 讨论(0)

孤独总比滥情好

2020-12-15 22:06

# create an index column
DT[, idx := 1:.N, by = id]

# find the indices of the matching future dates
DT[, fut.idx := DT[.(id = id, datetime = datetime+30*60), on = c('id', 'datetime')
                    , idx, roll = -Inf]]
#               datetime          x id  keep         difftime idx  fut.idx
# 1: 2016-04-28 09:20:18 0.02461368  1  TRUE   0.0000000 mins   1        4
# 2: 2016-04-28 09:41:34 0.88953932  1 FALSE  21.2666667 mins   2        6
# 3: 2016-04-28 09:46:07 0.31818101  1 FALSE  25.8166667 mins   3        6
# 4: 2016-04-28 10:00:56 0.14711365  1  TRUE  40.6333333 mins   4        6
# 5: 2016-04-28 10:09:11 0.54406602  1 FALSE  48.8833333 mins   5        7
# 6: 2016-04-28 10:39:09 0.69280341  1  TRUE  78.8500000 mins   6       NA
# 7: 2016-04-28 10:50:01 0.99426978  1 FALSE  89.7166667 mins   7       NA
# 8: 2016-04-28 10:51:46 0.47779597  1 FALSE  91.4666667 mins   8       NA
# 9: 2016-04-28 10:57:58 0.23162579  1 FALSE  97.6666667 mins   9       NA
#10: 2016-04-28 10:58:23 0.96302423  1 FALSE  98.0833333 mins  10       NA
#11: 2016-04-28 09:13:19 0.21640794  2  TRUE   0.0000000 mins   1        4
#12: 2016-04-28 09:13:44 0.70853047  2 FALSE   0.4166667 mins   2        4
#13: 2016-04-28 09:36:44 0.75845954  2 FALSE  23.4166667 mins   3        6
#14: 2016-04-28 09:55:31 0.64050681  2  TRUE  42.2000000 mins   4        8
#15: 2016-04-28 10:00:33 0.90229905  2 FALSE  47.2333333 mins   5        9
#16: 2016-04-28 10:11:51 0.28915974  2 FALSE  58.5333333 mins   6        9
#17: 2016-04-28 10:14:14 0.79546742  2 FALSE  60.9166667 mins   7        9
#18: 2016-04-28 10:26:17 0.69070528  2  TRUE  72.9666667 mins   8       10
#19: 2016-04-28 10:51:02 0.59414202  2 FALSE  97.7166667 mins   9       NA
#20: 2016-04-28 10:56:36 0.65570580  2  TRUE 103.2833333 mins  10       NA


# at this point the problem is "solved", but you still have to extract the solution
# and that's the more complicated part
DT[, keep.new := FALSE]

# iterate over the matching indices (jumping straight to the correct one)
DT[, {
       next.idx = 1

       while(!is.na(next.idx)) {
         set(DT, .I[next.idx], 'keep.new', TRUE)
         next.idx = fut.idx[next.idx]
       }
     }, by = id]

DT[, identical(keep, keep.new)]
#[1] TRUE

Alternatively for the last step, you can do (this will iterate over the entire thing, but I don't know what the speed impact would be):

DT[, keep.3 := FALSE]
DT[DT[, .I[na.omit(Reduce(function(x, y) fut.idx[x], c(1, fut.idx), accumulate = T))]
      , by = id]$V1
   , keep.3 := TRUE]

DT[, identical(keep, keep.3)]
#[1] TRUE

0 讨论(0)