Find all date ranges for overlapping start and end dates in R

后端 未结 3 584
再見小時候
再見小時候 2020-12-06 08:35

I have a data frame that looks like this:

w<-read.table(header=TRUE,text=\"
start.date   end.date
2006-06-26 2006-07-24
2006-07-19 2006-08-16
2007-06-09 2         


        
相关标签:
3条回答
  • 2020-12-06 08:57

    The IRanges package on Bioconductor includes the function reduce which can be utilized to combine overlapping start and end dates into one date range.

    IRanges works on integer ranges so you have to convert the data from class Date to integerand back. This can be wrapped up in a function:

    collapse_date_ranges <- function(w, min.gapwidth = 1L) {
      library(data.table)
      library(magrittr)
      IRanges::IRanges(start = as.integer(as.Date(w$start.date)), 
                       end = as.integer(as.Date(w$end.date))) %>% 
        IRanges::reduce(min.gapwidth = min.gapwidth) %>% 
        as.data.table() %>% 
        .[, lapply(.SD, lubridate::as_date),
          .SDcols = c("start", "end")]
    }
    
    collapse_date_ranges(w, 0L)
    #        start        end
    #1: 2006-06-26 2006-08-16
    #2: 2007-06-09 2007-07-31
    #3: 2007-08-04 2007-09-04
    #4: 2007-09-05 2007-10-12
    #5: 2007-10-19 2007-11-16
    #6: 2007-11-17 2007-12-15
    #7: 2008-06-18 2008-08-20
    
    collapse_date_ranges(x, 0L)
    #        start        end
    #1: 2006-01-19 2006-01-20
    #2: 2006-01-25 2006-01-29
    #3: 2006-02-24 2006-02-25
    #4: 2006-03-15 2006-03-22
    #5: 2006-04-29 2006-04-30
    #6: 2006-05-24 2006-05-25
    #7: 2006-06-26 2006-08-16
    #8: 2006-08-18 2006-08-19
    #9: 2006-08-28 2006-09-02
    

    Explanation

    • In order to avoid name clashes, I prefer the double colon operators :: to access single functions from the IRanges package over using library(IRanges) which loads the whole package.
    • The start and end dates are converted to integer (as.Date is just to ensure the proper class) and create an IRanges object.
    • reduce does all the hard work. The parameter min.gapwidth is required here as reduce collapses adjacent ranges by default (see below).
    • Finally, the result is converted back from integer to date. (You may use dplyr instead of data.table as well.)
    • The solution works for both sample data sets w and x. x includes a special case where one date range embeds other date ranges to full extent.

    Appendix: Collapsing adjacent date ranges

    The sample result given by the OP shows that adjacent data ranges should not be collapsed, e.g., the range 2007-10-19 to 2007-11-16 is separate from the range 2007-11-17 to 2007-12-15 although the second range starts only one day after the first one has ended.

    Just in case, adjacent date ranges are to be collapsed this can be achieved by using the default value of the min.gapwidth parameter:

    collapse_date_ranges(w)
    #        start        end
    #1: 2006-06-26 2006-08-16
    #2: 2007-06-09 2007-07-31
    #3: 2007-08-04 2007-10-12
    #4: 2007-10-19 2007-12-15
    #5: 2008-06-18 2008-08-20
    
    0 讨论(0)
  • 2020-12-06 09:03

    Solution.

    w<-read.table(header=TRUE, stringsAsFactor=F, text="
    start.date   end.date
    2006-06-26 2006-07-24
    2006-07-19 2006-08-16
    2007-06-09 2007-07-07
    2007-06-24 2007-07-22
    2007-07-03 2007-07-31
    2007-08-04 2007-09-01
    2007-08-07 2007-09-04
    2007-09-05 2007-10-03
    2007-09-14 2007-10-12
    2007-10-19 2007-11-16
    2007-11-17 2007-12-15
    2008-06-18 2008-07-16
    2008-06-28 2008-07-26
    2008-07-11 2008-08-08
    2008-07-23 2008-08-20")
    
    w <- data.frame(lapply(w, as.Date))
    
    library(lubridate)
    
    idx.rle <- rle(as.numeric(sapply(1:(nrow(w)-1), function(i) int_overlaps(interval(w[i,1],w[i,2]), interval(w[i+1,1],w[i+1,2])))))
    
    
    
    
    i.starts <- nrow(w)-rev(cumsum(rev(idx.rle$length)))
    i.ends <-  1+cumsum(idx.rle$length)
    
     do.call(rbind,
        lapply(1:length(idx.rle$lengths),
               function(i) {
                   i.start <- i.starts[i]
                   i.end <-  i.ends[i]
                   if(idx.rle$values[i]==1) {
                       d <- data.frame(start.date=w[i.start,1],
                                       end.date=max(w[i.start:i.end,2]) );
                       names(d) <- names(w);
                       d
                   } else {
                       if(idx.rle$lengths[i]>1&i>1&i<length(idx.rle$lengths)) {
                           data.frame(w[(i.start+1):(i.end-1),] )
                       } else {
                           if (idx.rle$lengths[i]>=1&i==1) {
                               data.frame(w[(i.start):(i.end-1),])
                           } else {
                               if(idx.rle$lengths[i]>=1&i==length(idx.rle$lengths)) data.frame(w[(i.start+1):(i.end),] ) 
                           }
                       }
                   }
               }))
    
    0 讨论(0)
  • 2020-12-06 09:05

    Try this:

    w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
    w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
    w$group <- 1:nrow(w) # common intervals should belong to same group
    merge.indices <- lapply(2:nrow(w), function(x) {
                        indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
                        if (length(indices) > 0) indices <- c(indices, x) 
                        indices})
    # assign the intervals the right groups
    for (i in 1:length(merge.indices)) {
      if (length(merge.indices[[i]]) > 0) {
        w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
      }
    }
    
    do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))
    

    It conceptually merges overlapping intervals into the same group as shown below:

    with output:

       start.date   end.date
    1  2006-01-19 2006-01-20
    2  2006-01-25 2006-01-29
    3  2006-02-24 2006-02-25
    4  2006-03-15 2006-03-22
    5  2006-04-29 2006-04-30
    6  2006-05-24 2006-05-25
    7  2006-06-26 2006-08-16
    11 2006-08-18 2006-08-19
    12 2006-08-28 2006-09-02
    
    0 讨论(0)
提交回复
热议问题