Random forests in R (empty classes in y and argument legth 0)

人盡茶涼 提交于 2019-11-30 17:49:30

Based on the discussion in the comments, here's a guess at a potential solution.

The confusion here arises from the fact that the levels of a factor are an attribute of the variable. Those levels will remain the same, no matter what subset you take of the data, no matter how small that subset. This is a feature, not a bug, and a common source of confusion.

If you want to drop missing levels when subsetting, wrap your subset operation in droplevels():

groupA <- droplevels(dataset2[dataset2$order=="groupA",])

I should probably also add that many R users set options(stringsAsFactors = FALSE) when starting a new session (e.g. in their .Rprofile file) to avoid these kinds of hassles. The downside to doing this is that if you share your code with other people frequently, this can cause problems if they haven't altered R's default options.

Robert Williams

When factor levels are removed by subsetting, you must reset levels:

levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "X" "Y" "b";
train11$str <- factor(train11$str);
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "b"

Try using the function formula before passing it to randomForest:

formula("y ~ a+b+c")

This fixed the problem for me.

Or it might that randomForest mistakes a parameter for another one.

Try specifying what each parameter is:

randomForest(,,, data=my_data, mtry=my_mtry, etc)

This is because you are sub setting your training set before sending the data to your random forest, and while sub setting there is a possibility of loosing some levels from your response variable after sub setting, therefore one need to reassign the factors by using this:

dataset2$response <- factor(dataset2$response)

to remove additional levels which are not present in data, after sub setting.

It seems the problem in the call statement. If you use formula interface then call

randomForest(response ~ predictorA + predictorB + ... + predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)

But it is more convenient and faster to explicitly pass x and y

randomForest(y = groupA$response, x = groupA[,c("predictorA", "predictorB", ...)], ntree=100, keep.forest=FALSE, importance=TRUE)

Instead of names of variables you can use their indices. Try these suggestions.

Just another suggestion to add to the mix: There is a chance that you don't want read.csv() to interpret strings as factors. Try adding this to read.csv to force conversion to characters:

dataset <- read.csv("datasetNA.csv", 
                    sep=";", 
                    header=T,
                    colClasses="character")

randomForest(x = data, y = label, importance = TRUE, ntree = 1000)

label is a factor, so use droplevels(label) to remove the levels with zero count before passing to randomForest function. It works.

To check the count for each level use table(label) function.

I had the same problem with you today and I had it solved. When you do Random Forest, the R default is classification while my response is numerical. When you use subsets as training dataset, the levels of the training are restricted compared with the test.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!