R - factor examcard has new levels

只谈情不闲聊 提交于 2019-12-13 02:15:33

问题


I built a classification model in R using C5.0 given below:

library(C50)
library(caret)
a = read.csv("All_SRN.csv")
set.seed(123)
inTrain <- createDataPartition(a$anatomy, p = .70, list = FALSE)
training <- a[ inTrain,]
test <- a[-inTrain,]
Tree <- C5.0(anatomy ~ ., data = training, 
            trControl = trainControl(method = "repeatedcv", repeats = 10,
                                     classProb = TRUE))
TreePred <- predict(Tree, test)

The training set has features like - examcard, coil_used, anatomy_region, bodypart_anatomy and anatomy(target class). All the features are categorical variables. There are a total of 10k odd values, I divided the data into training and test data. The learner worked great with this training and test set partioned in 70:30 ratio, but the problem comes when I provide the test set with new values given below:

TreePred <- predict(Tree, test_add)

Here, test_add contains the already present test set and a set of new values and on executing the learner fails to classify the new values and throws the following error:

Error in model.frame.default(object$Terms, newdata, na.action = na.action, : factor examcard has new levels

I tried to merge the new factor levels with the existing one using:

Tree$xlevels[["examcard"]] <- union(Tree$xlevels[["examcard"]], levels(test_add$examcard))

But, this wasn't of much help since the code executed with the following message and didn't yield any fruitful result:

predict code called exit with value 1

The feaure examcard holds a good deal of primacy in the classification hence can't be ignored. How can these set of values be classified?


回答1:


You cannot create a prediction for factor levels in your test set that are absent in your training set. Your model will not have coefficients for these new factor levels.

If you are doing a 70/30 split, you need to repartition your data using caret::CreateDataPartition...

... or your own stratified sample function to ensure that all levels are represented in the training set: use the "split-apply-combine" approach: split the data set by examcard, and for each subset, apply the split, then combine the training subsets and the testing subsets.

See this question for more details.



来源:https://stackoverflow.com/questions/31402819/r-factor-examcard-has-new-levels

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!