Find the duration of overlap between pairs of timestamps using data.table

问题

Similar to this question, I'd like to find the duration of overlap between pairs of timestamps using data.table.

Here's my current code:

library(data.table)

DT <- fread(
  "stage,ID,date1,date2
  1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
  1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
  1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
  2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
  2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
  sep = ","
)

cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]

breaks <- DT[, {
  tmp <- unique(sort(c(date1, date2)))
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),  
    by = .EACHI, allow.cartesian = T] %>% 
  mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))

Which returns:

  stage               date1               date2    V1 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B+A             900
2     1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C             300
3     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A+C            2400
4     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D+B            1200
5     2 2018-04-17 00:50:00 2018-04-17 01:10:00     B            1200

But I'd like to return only overlaps between user dyads (i.e. no more than two overlapping users). There are several hacky ways I can think of achieve this, such as:

library(dplyr)
library(tidyr)

result %>% 
  filter(nchar(V1)==3) %>% 
  tidyr::separate(V1, c("ID1", "ID2"))

Which returns:

  stage               date1               date2 ID1 ID2 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200

But this seems inelegant, especially when dealing with longer ID strings and potentially hundreds of IDs per overlap.

Ideally, I'd like to know if there's a way to modify the original data.table code to return this directly.

回答1:

At first glance (and neglecting performance considerations), this requires only a minor modification to OP's code:

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
             if (.N == 2L) paste(ID, collapse = "+"),  
             by = .EACHI, allow.cartesian = TRUE]
result

   stage               date1               date2  V1
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B

Only for those groups, i.e., time ranges, where exactly two users are active a result row will be created.

The OP has requested to show the two IDs in separate columns plus to show the duration of the overlap. In addition, I suggest to have the IDs sorted.

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
   if (.N == 2L) {
     tmp <- sort(ID)
     .(ID1 = tmp[1], ID2 = tmp[2], dur.in.sec = difftime(end, start, units = "secs"))
     },  
   by = .EACHI, allow.cartesian = TRUE]
result

   stage               date1               date2 ID1 ID2 dur.in.sec
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   A   B   900 secs
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C  2400 secs
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   B   D  1200 secs

回答2:

Another possibility:

DT[breaks, on = .(stage, date1 <= start, date2 >= end)
   ][, if (uniqueN(ID) == 2) .SD, by = .(stage, date1, date2)
     ][, dcast(.SD, stage + date1 + date2 ~ rowid(date1, prefix = 'ID'), value.var = 'ID')
       ][, lengthinseconds := as.numeric(difftime(date2, date1, units = "secs"))][]

which gives:

   stage               date1               date2 ID1 ID2 lengthinseconds
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200

来源：https://stackoverflow.com/questions/49880546/find-the-duration-of-overlap-between-pairs-of-timestamps-using-data-table

标签

datetime

data.table

overlap