问题
Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I'm trying to use a randomForest to predict sales. I have 3 variables, one of which is a factor variable for storeId. I know that there are levels in the test set that are NOT in the training set. I'm trying to get a prediction for only levels present in the training set but can't get it to look past the new factor levels.
Here's what I've tried so far:
require(randomForest)
train <- data.frame(sales = runif(10)*1000, storeId = factor(seq(1,10,1)), dat1 =runif(10), dat2 = runif(10)*10)
test <- data.frame(storeId = factor(seq(2,11,1)), dat1 =runif(10), dat2 = runif(10)*10)
> train
sales storeId dat1 dat2
1 414.7791 1 0.7830092 7.178577
2 719.5965 2 0.9512138 6.153049
3 887.3197 3 0.6879827 5.413556
4 706.5828 4 0.4486214 4.955400
5 326.8189 5 0.0944885 6.900802
6 840.5920 6 0.1917165 8.044636
7 936.2206 7 0.2173074 4.835064
8 244.6947 8 0.6526765 6.516790
9 818.8747 9 0.3317644 9.651675
10 631.6104 10 0.6998037 8.443972
> test
storeId dat1 dat2
1 2 0.7513645 3.442052
2 3 0.2862487 3.196189
3 4 0.4971865 6.074281
4 5 0.8631945 8.766129
5 6 0.3848105 5.001426
6 7 0.9032262 7.018274
7 8 0.1560501 4.523618
8 9 0.3461597 5.551672
9 10 0.1318464 3.092640
10 11 0.6587270 1.348623
> RF1 <- randomForest(train[,c("storeId","dat1","dat2")], train$sales, do.trace=TRUE,
+ importance=TRUE,ntree=5,,forest=TRUE)
| Out-of-bag |
Tree | MSE %Var(y) |
1 | 2.915e+05 544.44 |
2 | 1.825e+05 340.84 |
3 | 2.1e+05 392.19 |
4 | 1.914e+05 357.38 |
5 | 1.809e+05 337.78 |
> pred <- predict(RF1, test)
Error in predict.randomForest(RF1, test) :
New factor levels not present in the training data
This part makes sense.
So I try this:
> test2 <- test[test$storeId != 11,]
> pred <- predict(RF1, test2)
Error in predict.randomForest(RF1, test2) :
New factor levels not present in the training data
So I try this:
> levels(test2$storeId)
[1] "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
And the "11" level is still in there.
Next I try this:
> test2$storeId <- as.numeric(as.character(test2$storeId))
> test2$storeId <- factor(test2$storeId)
> pred <- predict(RF1, test2)
Error in predict.randomForest(RF1, test2) :
Type of predictors in new data do not match that of the training data.
despite the fact that things look ok here:
> levels(test2$storeId)
[1] "2" "3" "4" "5" "6" "7" "8" "9" "10"
Any suggestions for getting it to predict on just stores without the "11" level?
EDIT:
> test2$storeId <- as.factor(as.character(test2$storeId))
> pred <- predict(RF1, test2)
Error in predict.randomForest(RF1, test2) :
Type of predictors in new data do not match that of the training data.
>
> test2$storeId <- drop.levels(test2$storeId)
> pred <- predict(RF1, test2)
Error in predict.randomForest(RF1, test2) :
Type of predictors in new data do not match that of the training data.
> str(train)
'data.frame': 10 obs. of 4 variables:
$ sales : num 800 679 589 812 384 ...
$ storeId: Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10
$ dat1 : num 0.5148 0.5567 0.9871 0.0071 0.736 ...
$ dat2 : num 8.501 2.994 2.948 0.519 1.746 ...
> str(test)
'data.frame': 10 obs. of 3 variables:
$ storeId: Factor w/ 10 levels "2","3","4","5",..: 1 2 3 4 5 6 7 8 9 10
$ dat1 : num 0.0975 0.7435 0.7055 0.2085 0.2944 ...
$ dat2 : num 5.96 6.84 3.96 8.93 8.62 ...
> str(test2)
'data.frame': 9 obs. of 3 variables:
$ storeId: Factor w/ 9 levels "2","3","4","5",..: 1 2 3 4 5 6 7 8 9
$ dat1 : num 0.0975 0.7435 0.7055 0.2085 0.2944 ...
$ dat2 : num 5.96 6.84 3.96 8.93 8.62 ...
回答1:
You cannot run the randomForest predict function on newdata that has missing factors as compared to the rf model. Since the factor levels of test$storeId range "2"-"11" and the train$storeId "1"-"10", when you drop level 11 in the test data your are still missing level "1" and thus randomForest predict is failing.
回答2:
This is in fact a duplicate. You should be using droplevels and then after fixing that problem you're ignoring the fact that the levels still don't line up. You simply have to alter the levels so that they are the same as in the training data:
test1 <- droplevels(subset(test,storeId != 11))
levels(test1$storeId) <- as.character(c(2:10,1)
pred <- predict(RF1, test1)
> pred
1 2 3 4 5 6 7 8 9
698.9186 703.9761 654.5370 561.3058 491.1836 736.4316 639.8752 586.1755 782.1186
The moral here is simply that your training data had a factor with levels 1,2,...10, your test data has to have the exact same set of levels (whether or not you have any data for some of those levels).
来源:https://stackoverflow.com/questions/13055076/r-randomforest-subsetting-cant-get-rid-of-factor-levels