r random forest error - type of predictors in new data do not match

前端 未结 8 1425
挽巷
挽巷 2020-12-04 14:37

I am trying to use quantile regression forest function in R (quantregForest) which is built on Random Forest package. I am getting a type mismatch error that I can\'t quite

8条回答
  •  情书的邮戳
    2020-12-04 15:17

    @mgoldwasser is right in general, but there is also a very nasty bug in predict.randomForest: Even if you have exactly the same levels in the training and in the prediction set, it is possible to get this error. This is possible when you have a factor where you have embedded NA as a separate level. The problem is that predict.randomForest essentially does the following:

    # Assume your original factor has two "proper" levels + NA level:
    f <- factor(c(0,1,NA), exclude=NULL)
    
    length(levels(f)) # => 3
    levels(f)         # => "0" "1" NA
    
    # Note that
    sum(is.na(f))     # => 0
    # i.e., the values of the factor are not `NA` only the corresponding level is.
    
    # Internally predict.randomForest passes the factor (the one of the training set)
    # through the function `factor(.)`.
    # Unfortunately, it does _not_ do this for the prediction set.
    # See what happens to f if we do that:
    pf <- factor(f)
    
    length(levels(pf)) # => 2
    levels(pf)         # => "0" "1"
    
    # In other words:
    length(levels(f)) != length(levels(factor(f))) 
    # => sad but TRUE
    

    So, it will always discard the NA level from the training set and will always see one additional level in the prediction set.

    A workaround is to replace the value NA of the level before using randomForest:

    levels(f)[is.na(levels(f))] <- "NA"
    levels(f) # => "0"  "1"  "NA"
              #              .... note that this is no longer a plain `NA`
    

    Now calling factor(f) won't discard the level, and the check succeeds.

提交回复
热议问题