Subset observations that differ by at least 30 minutes time

后端 未结 3 681
忘掉有多难
忘掉有多难 2020-12-15 21:16

I have a data.table (~30 million rows) consisting of a datetime column in POSIXct format, an id column and a few other co

相关标签:
3条回答
  • 2020-12-15 21:50

    Using Rcpp:

    library(Rcpp)
    library(inline)
    cppFunction(
      'LogicalVector selecttimes(const NumericVector x) {
       const int n = x.length();
       LogicalVector res(n);
       res(0) = true;
       double testval = x(0);
       for (int i=1; i<n; i++) {
        if (x(i) - testval > 30 * 60) {
          testval = x(i);
          res(i) = true;
        }
       }
       return res;
      }')
    
    DT[, keep1 := selecttimes(datetime), by = id]
    
    DT[, all(keep == keep1)]
    #[1] TRUE
    

    Some additional testing should be done, it needs input validation, and the time difference could be made a parameter.

    0 讨论(0)
  • 2020-12-15 21:54

    Here's what I would do:

    setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
    
    s = 0L
    w = DT[, .I[1L], by=id]$V1
    
    while (length(w)){
       s = s + 1L
       DT[w, tag := s]
    
       m = DT[w, .(id, datetime = datetime+30*60)]
       w = DT[m, which = TRUE, roll=-Inf]
       w = w[!is.na(w)]
    }
    

    which gives

                   datetime          x id  keep tag
     1: 2016-04-28 10:20:18 0.02461368  1  TRUE   1
     2: 2016-04-28 10:41:34 0.88953932  1 FALSE  NA
     3: 2016-04-28 10:46:07 0.31818101  1 FALSE  NA
     4: 2016-04-28 11:00:56 0.14711365  1  TRUE   2
     5: 2016-04-28 11:09:11 0.54406602  1 FALSE  NA
     6: 2016-04-28 11:39:09 0.69280341  1  TRUE   3
     7: 2016-04-28 11:50:01 0.99426978  1 FALSE  NA
     8: 2016-04-28 11:51:46 0.47779597  1 FALSE  NA
     9: 2016-04-28 11:57:58 0.23162579  1 FALSE  NA
    10: 2016-04-28 11:58:23 0.96302423  1 FALSE  NA
    11: 2016-04-28 10:13:19 0.21640794  2  TRUE   1
    12: 2016-04-28 10:13:44 0.70853047  2 FALSE  NA
    13: 2016-04-28 10:36:44 0.75845954  2 FALSE  NA
    14: 2016-04-28 10:55:31 0.64050681  2  TRUE   2
    15: 2016-04-28 11:00:33 0.90229905  2 FALSE  NA
    16: 2016-04-28 11:11:51 0.28915974  2 FALSE  NA
    17: 2016-04-28 11:14:14 0.79546742  2 FALSE  NA
    18: 2016-04-28 11:26:17 0.69070528  2  TRUE   3
    19: 2016-04-28 11:51:02 0.59414202  2 FALSE  NA
    20: 2016-04-28 11:56:36 0.65570580  2  TRUE   4
    

    The idea behind it is described by the OP in a comment:

    per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

    0 讨论(0)
  • 2020-12-15 22:06
    # create an index column
    DT[, idx := 1:.N, by = id]
    
    # find the indices of the matching future dates
    DT[, fut.idx := DT[.(id = id, datetime = datetime+30*60), on = c('id', 'datetime')
                        , idx, roll = -Inf]]
    #               datetime          x id  keep         difftime idx  fut.idx
    # 1: 2016-04-28 09:20:18 0.02461368  1  TRUE   0.0000000 mins   1        4
    # 2: 2016-04-28 09:41:34 0.88953932  1 FALSE  21.2666667 mins   2        6
    # 3: 2016-04-28 09:46:07 0.31818101  1 FALSE  25.8166667 mins   3        6
    # 4: 2016-04-28 10:00:56 0.14711365  1  TRUE  40.6333333 mins   4        6
    # 5: 2016-04-28 10:09:11 0.54406602  1 FALSE  48.8833333 mins   5        7
    # 6: 2016-04-28 10:39:09 0.69280341  1  TRUE  78.8500000 mins   6       NA
    # 7: 2016-04-28 10:50:01 0.99426978  1 FALSE  89.7166667 mins   7       NA
    # 8: 2016-04-28 10:51:46 0.47779597  1 FALSE  91.4666667 mins   8       NA
    # 9: 2016-04-28 10:57:58 0.23162579  1 FALSE  97.6666667 mins   9       NA
    #10: 2016-04-28 10:58:23 0.96302423  1 FALSE  98.0833333 mins  10       NA
    #11: 2016-04-28 09:13:19 0.21640794  2  TRUE   0.0000000 mins   1        4
    #12: 2016-04-28 09:13:44 0.70853047  2 FALSE   0.4166667 mins   2        4
    #13: 2016-04-28 09:36:44 0.75845954  2 FALSE  23.4166667 mins   3        6
    #14: 2016-04-28 09:55:31 0.64050681  2  TRUE  42.2000000 mins   4        8
    #15: 2016-04-28 10:00:33 0.90229905  2 FALSE  47.2333333 mins   5        9
    #16: 2016-04-28 10:11:51 0.28915974  2 FALSE  58.5333333 mins   6        9
    #17: 2016-04-28 10:14:14 0.79546742  2 FALSE  60.9166667 mins   7        9
    #18: 2016-04-28 10:26:17 0.69070528  2  TRUE  72.9666667 mins   8       10
    #19: 2016-04-28 10:51:02 0.59414202  2 FALSE  97.7166667 mins   9       NA
    #20: 2016-04-28 10:56:36 0.65570580  2  TRUE 103.2833333 mins  10       NA
    
    
    # at this point the problem is "solved", but you still have to extract the solution
    # and that's the more complicated part
    DT[, keep.new := FALSE]
    
    # iterate over the matching indices (jumping straight to the correct one)
    DT[, {
           next.idx = 1
    
           while(!is.na(next.idx)) {
             set(DT, .I[next.idx], 'keep.new', TRUE)
             next.idx = fut.idx[next.idx]
           }
         }, by = id]
    
    DT[, identical(keep, keep.new)]
    #[1] TRUE
    

    Alternatively for the last step, you can do (this will iterate over the entire thing, but I don't know what the speed impact would be):

    DT[, keep.3 := FALSE]
    DT[DT[, .I[na.omit(Reduce(function(x, y) fut.idx[x], c(1, fut.idx), accumulate = T))]
          , by = id]$V1
       , keep.3 := TRUE]
    
    DT[, identical(keep, keep.3)]
    #[1] TRUE
    
    0 讨论(0)
提交回复
热议问题