rbindlist two data.tables where one has factor and other has character type for a column

后端 未结 3 826
醉话见心
醉话见心 2020-12-09 18:03

I just discovered this warning in my script that was a bit strange.

# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
         


        
相关标签:
3条回答
  • 2020-12-09 18:10

    The bug is not fixed in R 4.0.2 and data.table 1.13.0. When I try to rbindlist() two DTs, one of which has factor columns, the other one empty, final result gets this column broken, and factor values mangled (\n occuring randomly; levels are broken, NAs are introduced).
    The workaround is to not rbindlisting a DT with an empty one, but instead rbindlist it with other DTs which also has payload data. Although this requires some boilerplate code.

    0 讨论(0)
  • 2020-12-09 18:11

    rbindlist is superfast because it doesn't do the checking of rbindfill or do.call(rbind.data.frame,...)

    You can use a workaround like this to ensure that factors are coerced to characters.

    DT.1 <- data.table(x = factor(letters[1:5]), y = 6:10)
    DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)
    
    
    for(ii in seq_along(DDL)){
      ff <- Filter(function(x) is.factor(DDL[[ii]][[x]]), names(DDL[[ii]]))
      for(fn in ff){
        set(DDL[[ii]], j = fn, value = as.character(DDL[[ii]][[fn]]))
        }
      }
     rbindlist(DDL)
    

    or (less memory efficiently)

    rbindlist(rapply(DDL, classes = 'factor', f= as.character, how = 'replace'))
    
    0 讨论(0)
  • 2020-12-09 18:29

    UPDATE - This bug (#2650) was fixed on 17 May 2013 in v1.8.9


    I believe that rbindlist when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.

    As in this bug report: http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975


    # Temporary workaround: 
    
    levs <- c(as.character(DT.1$x), as.character(DT.2$x))
    
    DT.1[, x := factor(x, levels=levs)]
    DT.2[, x := factor(x, levels=levs)]
    
    rbindlist(list(DT.1, DT.2))
    

    As another view of whats going on:

    DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
    DT4 <- copy(DT3)
    
    DT3[, x := factor(x, levels=x)]
    DT4[, x := factor(x, levels=x, labels=rev(x))]
    
    DT3
    DT4
    
    # Have a look at the difference:
    rbindlist(list(DT3, DT4))$x
    # [1] 1st 2nd 1st 2nd
    # Levels: 1st 2nd
    
    do.call(rbind, list(DT3, DT4))$x
    # [1] 1st 2nd 2nd 1st
    # Levels: 1st 2nd
    

    Edit as per comments:

    as for observation 1, what's happening is similar to:

    x <- factor(LETTERS[1:5])
    
    x[6:10] <- letters[1:5]
    x
    
    # Notice however, if you are assigning a value that is already present
    x[11] <- "S"  # warning, since `S` is not one of the levels of x
    x[12] <- "D"  # all good, since `D` *is* one of the levels of x
    
    0 讨论(0)
提交回复
热议问题