Subset observations that differ by at least 30 minutes time

后端 未结 3 693
忘掉有多难
忘掉有多难 2020-12-15 21:16

I have a data.table (~30 million rows) consisting of a datetime column in POSIXct format, an id column and a few other co

3条回答
  •  被撕碎了的回忆
    2020-12-15 21:54

    Here's what I would do:

    setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
    
    s = 0L
    w = DT[, .I[1L], by=id]$V1
    
    while (length(w)){
       s = s + 1L
       DT[w, tag := s]
    
       m = DT[w, .(id, datetime = datetime+30*60)]
       w = DT[m, which = TRUE, roll=-Inf]
       w = w[!is.na(w)]
    }
    

    which gives

                   datetime          x id  keep tag
     1: 2016-04-28 10:20:18 0.02461368  1  TRUE   1
     2: 2016-04-28 10:41:34 0.88953932  1 FALSE  NA
     3: 2016-04-28 10:46:07 0.31818101  1 FALSE  NA
     4: 2016-04-28 11:00:56 0.14711365  1  TRUE   2
     5: 2016-04-28 11:09:11 0.54406602  1 FALSE  NA
     6: 2016-04-28 11:39:09 0.69280341  1  TRUE   3
     7: 2016-04-28 11:50:01 0.99426978  1 FALSE  NA
     8: 2016-04-28 11:51:46 0.47779597  1 FALSE  NA
     9: 2016-04-28 11:57:58 0.23162579  1 FALSE  NA
    10: 2016-04-28 11:58:23 0.96302423  1 FALSE  NA
    11: 2016-04-28 10:13:19 0.21640794  2  TRUE   1
    12: 2016-04-28 10:13:44 0.70853047  2 FALSE  NA
    13: 2016-04-28 10:36:44 0.75845954  2 FALSE  NA
    14: 2016-04-28 10:55:31 0.64050681  2  TRUE   2
    15: 2016-04-28 11:00:33 0.90229905  2 FALSE  NA
    16: 2016-04-28 11:11:51 0.28915974  2 FALSE  NA
    17: 2016-04-28 11:14:14 0.79546742  2 FALSE  NA
    18: 2016-04-28 11:26:17 0.69070528  2  TRUE   3
    19: 2016-04-28 11:51:02 0.59414202  2 FALSE  NA
    20: 2016-04-28 11:56:36 0.65570580  2  TRUE   4
    

    The idea behind it is described by the OP in a comment:

    per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

提交回复
热议问题