Rbind with new columns and data.table

后端 未结 5 694
借酒劲吻你
借酒劲吻你 2020-12-30 07:38

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original

相关标签:
5条回答
  • 2020-12-30 07:47

    the basic concept is to add missing columns in both directions: from the running master table to the newTable and back the other way.

    As @menl pointed out in the comments, simply assigning an NA is a problem, because that will make the whole column of class logical.

    One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.

    Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class) as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName) on the LHS to assign.

    We can then wrap this in a function and use S3 methods so that we can simply call

    rbindFill(A, B)
    

    Below is the function.

    rbindFill.data.table <- function(master, newTable)  {
    # Append newTable to master
    
        # assign to Master
        #-----------------#
          # identify columns missing
          colMisng     <- setdiff(names(newTable), names(master))
    
          # if there are no columns missing, move on to next part
          if (!identical(colMisng, character(0)))  {
               # identify class of each
                colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))
    
                # assign to each column value of NA with appropriate class 
                master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
              }
    
        # assign to newTable
        #-----------------#
          # identify columns missing
          colMisng     <- setdiff(names(master), names(newTable))
    
          # if there are no columns missing, move on to next part
          if (!identical(colMisng, character(0)))  {
            # identify class of each
            colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))
    
            # assign to each column value of NA with appropriate class 
            newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
          }
    
        # reorder columns to avoid warning about ordering
        #-----------------#
          colOrdering <- colOrderingByOtherCol(newTable, names(master))
          setcolorder(newTable,  colOrdering)
    
        # rbind them! 
        #-----------------#
          rbind(master, newTable)
      }
    
      # implement generic function
      rbindFill <- function(x, y, ...) UseMethod("rbindFill")
    


    Example Usage:

        # Sample Data: 
        #--------------------------------------------------#
        A  <- data.table(a=1:3, b=1:3, c=1:3)
        A2 <- data.table(a=6:9, b=6:9, c=6:9)
        B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
        C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
        #--------------------------------------------------#
    
        # Four iterations of calling rbindFill
        master <- rbindFill(A, B)
        master <- rbindFill(master, A2)
        master <- rbindFill(master, C)
    
        # Results:
        master
        #      a  b c  d  m     n     f
        #  1:  1  1 1 NA NA    NA    NA
        #  2:  2  2 2 NA NA    NA    NA
        #  3:  3  3 3 NA NA    NA    NA
        #  4: NA  1 1  1  A    NA    NA
        #  5: NA  2 2  2  B    NA    NA
        #  6: NA  3 3  3  C    NA    NA
        #  7:  6  6 6 NA NA    NA    NA
        #  8:  7  7 7 NA NA    NA    NA
        #  9:  8  8 8 NA NA    NA    NA
        # 10:  9  9 9 NA NA    NA    NA
        # 11: NA NA 7 NA NA  0.86  TRUE
        # 12: NA NA 8 NA NA -1.15 FALSE
        # 13: NA NA 9 NA NA  1.10  TRUE
    
    0 讨论(0)
  • 2020-12-30 08:00

    Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.

    Also, using rbindlist() should be much faster when dealing with data.tables.

    Define the tables (same as mnel's code above):

    library(data.table)
    A  <- data.table(a=1:3, b=1:3, c=1:3)
    A2 <- data.table(a=6:9, b=6:9, c=6:9)
    B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
    C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
    

    Insert the missing variables in table A: (note the use of A2[0]

    A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)
    

    Insert the missing columns in table A2:

    A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)
    

    Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):

    setcolorder(A2, names(A))
    DT.ALL <- rbindlist(l=list(A,A2))
    DT.ALL
    

    Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...

    DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
    B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
    setcolorder(B, names(DT.ALL))
    DT.ALL <- rbindlist(l=list(DT.ALL, B))
    
    DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
    C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
    setcolorder(C, names(DT.ALL))
    DT.ALL <- rbindlist(l=list(DT.ALL, C))
    DT.ALL
    

    The result looks the same as mnels' output (except for the random numbers and the column order).

    PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?

    PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

    0 讨论(0)
  • 2020-12-30 08:06

    Here is an approach that will update the missing columns in

    rbind.missing <- function(A, B) { 
    
      cols.A <- names(A)
      cols.B <- names(B)
    
      missing.A <- setdiff(cols.B,cols.A)
      # check and define missing columns in A
      if(length(missing.A) > 0L){
       # .. means "look up one level"
       class.missing.A <- lapply(B[, ..missing.A], class)
       nas.A <- lapply(class.missing.A, as, object = NA)
       A[,c(missing.A) := nas.A]
      }
      # check and define missing columns in B
      missing.B <- setdiff(names(A), cols.B)
      if(length(missing.B) > 0L){
        class.missing.B <- lapply(A[, ..missing.B], class)
        nas.B <- lapply(class.missing.B, as, object = NA)
        B[,c(missing.B) := nas.B]
      }
      # reorder so they are the same
      setcolorder(B, names(A))
      rbind(A, B)
    
    }
    
    rbind.missing(dt.1,dt.2)
    
    ##    aa bb cc
    ## 1:  1  2 NA
    ## 2:  2  3 NA
    ## 3:  3  4 NA
    ## 4:  1  2  3
    ## 5:  2  3  4
    ## 6:  3  4  5
    

    This will not be efficient for many, or large data.tables, as it only works two at a time.

    0 讨论(0)
  • 2020-12-30 08:08

    Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:

    If TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names has to be TRUE, and all items of the input list has to have non-null column names.

    Thus you can do (prior to approx v1.9.6):

    data.table::rbind(dt.1, dt.2, fill=TRUE) 
    #    aa bb cc
    # 1:  1  2 NA
    # 2:  2  3 NA
    # 3:  3  4 NA
    # 4:  1  2  3
    # 5:  2  3  4
    # 6:  3  4  5
    

    UPDATE for v1.9.6:

    This now works directly:

    rbind(dt.1, dt.2, fill=TRUE)
    #    aa bb cc
    # 1:  1  2 NA
    # 2:  2  3 NA
    # 3:  3  4 NA
    # 4:  1  2  3
    # 5:  2  3  4
    # 6:  3  4  5
    
    0 讨论(0)
  • 2020-12-30 08:11

    The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.

    0 讨论(0)
提交回复
热议问题