How to flatten / merge overlapping time periods

☆樱花仙子☆ 提交于 2019-11-26 16:30:36

Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15
zack

@David Arenburg's answer is great - but I ran into an issue where an earlier interval ended after a later interval - but using last in the summarise call resulted in the wrong end date. I'd suggest changing first(start) and last(end) to min(start) and max(end)

data %>%
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = min(start), end = max(end))

Also, as @Jonno Bourne mentioned, sorting by start and any grouping variables is important before applying the method.

For the sake of completeness, the IRanges package on Bioconductor has some neat functions which can be used to deal with date or date time ranges. One of it is the reduce() function which merges overlapping or adjacent ranges.

However, there is a drawback because IRanges works on integer ranges (hence the name), so the convenience of using IRanges functions comes at the expense of converting Date or POSIXct objects to and fro.

Also, it seems that dplyr doesn't play well with IRanges (at least judged by my limited experience with dplyr) so I use data.table:

library(data.table)
options(datatable.print.class = TRUE)
library(IRanges)
library(lubridate)

setDT(data)[, {
  ir <- reduce(IRanges(as.numeric(start), as.numeric(end)))
  .(start = as_datetime(start(ir)), end = as_datetime(end(ir)))
}, by = ID]
       ID      start        end
   <fctr>     <POSc>     <POSc>
1:      A 2013-01-01 2013-01-06
2:      A 2013-01-07 2013-01-11
3:      A 2013-01-12 2013-01-15

A code variant is

setDT(data)[, as.data.table(reduce(IRanges(as.numeric(start), as.numeric(end))))[
  , lapply(.SD, as_datetime), .SDcols = -"width"], 
  by = ID]

In both variants the as_datetime() from the lubridate packages is used which spares to specify the origin when converting numbers to POSIXct objects.

Would be interesting to see a benchmark comparision of the IRanges approaches vs David's answer.

It looks like I'm a little late to the party, but I took @zach's code and re-wrote it using data.table below. I didn't do comprehensive testing, but this seemed to run about 20% faster than the tidy version. (I couldn't test the IRange method because the package is not yet available for R 3.5.1)

Also, fwiw, the accepted answer doesn't capture the edge case in which one date range is totally within another (e.g., 2018-07-07 to 2017-07-14 is within 2018-05-01 to 2018-12-01). @zach's answer does capture that edge case.

library(data.table)

start_col = c("2018-01-01","2018-03-01","2018-03-10","2018-03-20","2018-04-10","2018-05-01","2018-05-05","2018-05-10","2018-07-07")
end_col = c("2018-01-21","2018-03-21","2018-03-31","2018-04-09","2018-04-30","2018-05-21","2018-05-26","2018-05-30","2018-07-14")

# create fake data, double it, add ID
# change row 17, such that each ID grouping is a little different
# also adds an edge case in which one date range is totally within another
# (this is the edge case not currently captured by the accepted answer)
d <- data.table(start_col = as.Date(start_col), end_col = as.Date(end_col))
d2<- rbind(d,d)
d2[1:(.N/2), ID := 1]
d2[(.N/2 +1):.N, ID := 2]
d2[17,end_col := as.Date('2018-12-01')]

# set keys (also orders)
setkey(d2, ID, start_col, end_col)

# get rid of overlapping transactions and do the date math
squished <- d2[,.(START_DT = start_col, 
                  END_DT = end_col, 
                  indx = c(0, cumsum(as.numeric(lead(start_col)) > cummax(as.numeric(end_col)))[-.N])),
               keyby=ID
               ][,.(start=min(START_DT), 
                    end = max(END_DT)),
                 by=c("ID","indx")
                 ]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!