Find overlapping dates for each ID and create a new row for the overlap

后端 未结 2 1074
北荒
北荒 2020-12-09 13:05

I would like to find the overlapping dates for each ID and create a new row with the overlapping dates and also combine the characters (char) for the lines. It is possible t

相关标签:
2条回答
  • 2020-12-09 13:25

    Introduction

    The for-loop you added to your question and the included comparison were a good start. The should be some additional brackets ( and ) in the date comparison. This for-loop-approach automatically considers new rows in the data frame. Therefore, you can get three-, four- and more-character strings in the char column.

    Create input data

    df = as.data.frame(list('ID'=c(15, 15, 16, 17, 17, 17, 17, 17, 17, 17),
                            'date1'=as.Date(c('2003-04-05', '2003-04-20', '2001-01-02', '2003-03-05', '2005-04-15', '2007-05-15', '2008-02-05', '2010-06-07', '2010-09-22', '2012-02-28')),
                            'date2'=as.Date(c('2003-05-06', '2003-06-20', '2002-03-04', '2007-02-22', '2014-05-19', '2008-02-05', '2012-02-14', '2011-02-14', '2014-05-19', '2013-03-04')),
                            'char'=c('E', 'R', 'M', 'I', 'C', 'I', 'M', 'V', 'P', 'R')),
                       stringsAsFactors=FALSE)
    

    Solution

    Iterate all rows (that were existing in the original data.frame) and compare them to all current lines.

    nrow_init = nrow(df)
    for (i in 1:(nrow(df)-1)) {
      print(i)
      ## get rows of df that have overlapping dates
      ##   (1:nrow(df))>i :: consider only rows below the current row to avoid double processing of two row-pairs
      ##   (!grepl(df$char[i],df$char)) :: prevent double letters
      ## Because we call nrow(df) each time (and not save it as a variable once in the beginning), we consider also new rows here. Therefore, we do not need the specific procedure for comparing 3 or more rows.
      loc = ((1:nrow(df))>i) & (!grepl(df$char[i],df$char)) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date1>df$date1[i]) & (df$date1<df$date2[i])) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)) | ((df$date2<df$date2[i]) & (df$date2>df$date1[i])))
      ## Uncomment this line, if you want to compare only two rows each and not more
      # loc = ((1:nrow(df))<=nrow_init) & ((1:nrow(df))>i) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)))
    
      ## proceed only of at least one duplicate row was found
      if (sum(loc) > 0) {
        # build new rows
        #  pmax and pmin do element-wise min and max calculation; df$date1[i] and df$date2[i] are automatically extended to the length of df$date1[loc] and df$date2[loc], respectively
        df_append = as.data.frame(list('ID'=df$ID[loc],
                                       'date1'=pmax(df$date1[i],df$date1[loc]),
                                       'date2'=pmin(df$date2[i],df$date2[loc]),
                                       'char'=paste0(df$char[i],df$char[loc])))
        ## append new rows
        df = rbind(df, df_append)
      }
    }
    
    ## create a new column and sort the characters in it
    ##  idea for sort: https://stackoverflow.com/a/5904854/4612235
    df$sort_char = df$char
    for (i in 1:nrow(df)) df$sort_char[i] = paste(sort(unlist(strsplit(df$sort_char[i], ""))), collapse = "")
    ## remove duplicates
    df = df[!duplicated(df[c('ID', 'date1', 'date2', 'sort_char')]),]
    ## remove additional column
    df$sort_char = NULL
    

    Out put

    ID      date1      date2 char
    15 2003-04-05 2003-05-06    E
    15 2003-04-20 2003-06-20    R
    16 2001-01-02 2002-03-04    M
    17 2003-03-05 2007-02-22    I
    17 2005-04-15 2014-05-19    C
    17 2007-05-15 2008-02-05    I
    17 2008-02-05 2012-02-14    M
    17 2010-06-07 2011-02-14    V
    17 2010-09-22 2014-05-19    P
    17 2012-02-28 2013-03-04    R
    15 2003-04-20 2003-05-06   ER
    17 2005-04-15 2007-02-22   IC
    17 2007-05-15 2008-02-05   CI
    17 2008-02-05 2012-02-14   CM
    17 2010-06-07 2011-02-14   CV
    17 2010-09-22 2014-05-19   CP
    17 2012-02-28 2013-03-04   CR
    17 2010-06-07 2011-02-14   MV
    17 2010-09-22 2012-02-14   MP
    17 2010-06-07 2011-02-14  MCV
    17 2010-09-22 2012-02-14  MCP
    17 2010-09-22 2011-02-14   VP
    17 2010-09-22 2011-02-14  VCP
    17 2010-09-22 2011-02-14  VMP
    17 2010-09-22 2011-02-14 VMCP
    17 2012-02-28 2013-03-04   PR
    17 2012-02-28 2013-03-04  PCR
    
    0 讨论(0)
  • 2020-12-09 13:40

    First, we create a data.table of all possible intervals for each ID.

    All possible intervals means that we take all start and end dates of an ID and combine them in a sorted vector tmp. The unique values indicate all possible intersections (or breaks) of all given intervals of the ID on the time axis. For later joining, the breaks are re-arranged in one interval per row with a start and an end column:

    library(data.table)
    options(datatable.print.class = TRUE)
    breaks <- DT[, {
      tmp <- unique(sort(c(date1, date2)))
      .(start = head(tmp, -1L), end = tail(tmp, -1L))
      }, by = ID]
    breaks
    
           ID      start        end
        <int>     <IDat>     <IDat>
     1:    15 2003-04-05 2003-04-20
     2:    15 2003-04-20 2003-05-06
     3:    15 2003-05-06 2003-06-20
     4:    16 2001-01-02 2002-03-04
     5:    17 2003-03-05 2005-04-15
     6:    17 2005-04-15 2007-02-22
     7:    17 2007-02-22 2007-05-15
     8:    17 2007-05-15 2008-02-05
     9:    17 2008-02-05 2010-06-07
    10:    17 2010-06-07 2010-09-22
    11:    17 2010-09-22 2011-02-14
    12:    17 2011-02-14 2012-02-14
    13:    17 2012-02-14 2012-02-28
    14:    17 2012-02-28 2013-03-04
    15:    17 2013-03-04 2014-05-19
    

    Then, a non-equi join is performed whereby the values are aggregated simultaneously on the join conditions (by = .EACHI is called grouping by each i, see this answer for a more detailed explanation):

    DT[breaks, on = .(ID, date1 <= start, date2 >= end), paste(char, collapse = ""),  
       by = .EACHI, allow.cartesian = TRUE]
    
           ID      date1      date2     V1
        <int>     <IDat>     <IDat> <char>
     1:    15 2003-04-05 2003-04-20      E
     2:    15 2003-04-20 2003-05-06     ER
     3:    15 2003-05-06 2003-06-20      R
     4:    16 2001-01-02 2002-03-04      M
     5:    17 2003-03-05 2005-04-15      I
     6:    17 2005-04-15 2007-02-22     IC
     7:    17 2007-02-22 2007-05-15      C
     8:    17 2007-05-15 2008-02-05     CI
     9:    17 2008-02-05 2010-06-07     CM
    10:    17 2010-06-07 2010-09-22    CMV
    11:    17 2010-09-22 2011-02-14   CMVP
    12:    17 2011-02-14 2012-02-14    CMP
    13:    17 2012-02-14 2012-02-28     CP
    14:    17 2012-02-28 2013-03-04    CPR
    15:    17 2013-03-04 2014-05-19     CP
    

    The result differs from the expected result posted by the OP but plotting the data convinces that the above result shows all possible overlaps:

    library(ggplot2)
    ggplot(DT) + aes(y = char, yend = char, x = date1, xend = date2) + 
      geom_segment() + facet_wrap("ID", ncol = 1L) 
    

    Data

    library(data.table)
    DT <- fread(
      "ID    date1         date2       char
    15  2003-04-05  2003-05-06      E
    15  2003-04-20  2003-06-20      R
    16  2001-01-02  2002-03-04      M
    17  2003-03-05  2007-02-22      I   
    17  2005-04-15  2014-05-19      C
    17  2007-05-15  2008-02-05      I
    17  2008-02-05  2012-02-14      M
    17  2010-06-07  2011-02-14      V
    17  2010-09-22  2014-05-19      P
    17  2012-02-28  2013-03-04      R"
    )
    cols <- c("date1", "date2")
    DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
    
    0 讨论(0)
提交回复
热议问题