Error in predicting raster with randomForest, Caret, and factor variables

问题

I am trying to predict a raster layer with randomForest and the caret package, but fail when I introduce factor variables. Without factors, everything works fine, but as soon as I bring a factor in, I get the error:

Error in predict.randomForest(modelFit, newdata) : Type of predictors in new data do not match that of the training data.

I have created some sample code below that walks through he process. I present it in a few steps for transparency and to provide a working example.

(To skip the set-up code, jump from here on down...)

First is a creating sample data, fitting RF models, and predicting raster with NO factors involved. Everything works fine.

# simulate data
x1p <- runif(50, 10, 20) # presence
x2p <- runif(50, 100, 200)
x1a <- runif(50, 15, 25) # absence
x2a <- runif(50, 180, 400)
x1 <- c(x1p, x1a)
x2 <- c(x2p,x2a)
y <- c(rep(1,50), rep(0,50)) # presence/absence
d <- data.frame(x1 = x1, x2 = x2, y = y)

# RF Classification on data with no factors... works fine
require(randomForest)
dRF <- d
dRF$y <- factor(ifelse(d$y == 1, "present", "absent"),
                levels = c("present", "absent"))
rfFit <- randomForest(y = dRF$y, x = dRF[,1:2], ntree=100) # RF Classfication

# Create sample Rasters
require(raster)
r1 <- r2 <- raster(nrow=100, ncol=100)
values(r1) <- runif(ncell(r1), 5, 25 )
values(r2) <- runif(ncell(r2), 85, 500 )
s <- stack(r1, r2)
names(s) <- c("x1", "x2")

# raster::predict() with no factors, works fine.
model <- predict(s, rfFit, na.rm=TRUE, type="prob", progress='text')
spplot(model)

The next steps are creating a factor variable to add to the training data and creating a raster with matching values for the prediction. Note that the raster is a regular old integer, not a as.factor raster. Everything still works fine...

# Create factor variable
x3p <- sample(0:5, 50, replace=T)
x3a <- sample(3:7, 50, replace=T)
x3 <- c(x3p, x3a)
dFac <- dRF
dFac$x3 <- as.factor(x3)
dFac <- dFac[,c(1,2,4,3)] # reorder

# RF model with factors, works fine
rfFit2 <- randomForest(y ~ x1 + x2 + x3, data=dFac, ntree=100)

# Create new raster, but not as.factor()
r3 <- raster(nrow=100, ncol=100)
values(r3) <- sample(0:7, ncell(r3), replace=T)
s2 <- stack(s, r3)
names(s2) <- c("x1", "x2", "x3") 
s2 <- brick(s2) # brick or stack, either work

# RF, raster::predict() from fit with factor
f <- levels(dFac$x3) # included, but not necessary
model2 <- predict(s2, rfFit2,  type="prob", 
          progress='text', factors=f, index=1:2)
spplot(model2) # works fine

After the above steps, I now have a RF model that is trained with data including a factor variable and predicted on a raster brick that contains an integer raster of like values. That is my end goal, but I want to be able to do it through the caret package workflow. Below I introduce caret::train() with no factors and all works well.

# RF with Caret and NO factors
require(caret)
rf_ctrl <- trainControl(method = "cv", number=10,
           allowParallel=FALSE, verboseIter=TRUE, 
           savePredictions=TRUE, classProbs=TRUE) 
cFit1 <- train(y = dRF$y, x = dRF[,1:2], method = "rf", 
         tuneLength=4, trControl = rf_ctrl, importance = TRUE)
model3 <- predict(s2, cFit1,  type="prob", 
          progress='text', factors=f, index=1:2) 
spplot(model3) # works with caret and NO factors

(...to here. This is where the issues begin)

Here is where things fails. A caret trained Rf model with a a factor variable works, but fails at raster::predict().

# RF with Caret and FACTORS
rf_ctrl2 <- trainControl(method = "cv", number=10,
            allowParallel=FALSE, verboseIter=TRUE, 
            savePredictions=TRUE, classProbs=TRUE)
cFit2 <- train(y = dFac$y, x = dFac[,1:3], method = "rf", 
         tuneLength=4, trControl = rf_ctrl2, importance = TRUE)
model4 <- predict(s2, cFit2,  type="prob", 
          progress='text', factors=f, index=1:2) 
# FAIL: "Type of predictors in new data do not match that of the training data."

Trying the same as above, but instead of an integer raster that has the same values as the factor levels, I make the raster into a factor using as.factor() and assigning levels. This fails as well.

#trying with raster as.factor()
r3f <- raster(nrow=100, ncol=100)
values(r3f) <- sample(0:7, ncell(r3f), replace=T)
r3f <- as.factor(r3f)
f <- levels(r3f)[[1]]
f$code <- as.character(f[,1])
levels(r3f) <- f
s2f <- stack(s, r3f)
names(s2f) <- c("x1", "x2", "x3")
s2f <- brick(s2f)

model4f <- predict(s2f, cFit2,  type="prob", 
           progress='text', factors=f, index=1:2)
# FAIL "Type of predictors in new data do not match that of the training data."

The error and progression of steps above clearly suggests that there is an issue with my approach and caret:train() vs. raster::predict(). I have walked through the debug (to the best of my ability) and addressed issues I noticed, but there was no smoking gun.

Any and all help would be greatly appreciated. Thanks!

Added: I was continuing to mess around realized that it works if the model in caret::train() is written in formula form. Looking at the structure of the model object, it is easily seen that contrasts are created for the factor variable. I suppose this also means that raster::predict() recognizes the contrasts. This is good, but a bummer because my methods are not set up to use formula based predictions. Any additional help is still appreciated.

#with Caret WITH FACTORS as model formula!
rf_ctrl3 <- trainControl(method = "cv", number=10,
            allowParallel=FALSE, verboseIter=TRUE, savePredictions=TRUE, classProbs=TRUE)
cFit3 <- train(y ~ x1 + x2 + x3, data=dFac, method = "rf", 
            tuneLength=4, trControl = rf_ctrl2, importance = TRUE)

model5 <- predict(s2, cFit3,  type="prob", progress='text') # prediction raster
spplot(model5)

回答1:

It took a good bit of testing, but the answer is that raster::predict() only works with models generated from caret::train() that contain factors, if the model is presented as a formula (y ~ x1 + x2 + x3) and not as y = y, x = x (as a matrix or data.frame). Only through the formula interface will the the model create the proper contrasts or dummy variables. There is no need to make your raster layers into factors via as.factor(). The predict function will do that for you.

回答2:

Your code is working using factors with raster::predict and the caret model with the non-formula interface, if you convert the structure of the input to the argument factors for the function raster::predict to a list:

f <- list(x3 = levels(dFac$x3))

(Replace line f <- levels(dFac$x3) # included, but not necessary.)

Your code

# RF with Caret and FACTORS
rf_ctrl2 <- trainControl(method = "cv", number=10,
                         allowParallel=FALSE, verboseIter=TRUE, 
                         savePredictions=TRUE, classProbs=TRUE)
cFit2 <- train(y = dFac$y, x = dFac[,1:3], method = "rf", 
                tuneLength=4, trControl = rf_ctrl2, importance = TRUE)
model4 <- predict(s2, cFit2,  type="prob", 
                  progress='text', factors=f, index=1:2)

then runs without errors.

来源：https://stackoverflow.com/questions/25121725/error-in-predicting-raster-with-randomforest-caret-and-factor-variables

标签

Spatial

raster

prediction

random-forest