I'm dealing for the first time with random forests and I'm having some troubles that I can't figure out.. When I run the analysis on all my dataset (about 3000 rows) I don't get any error message. But when I perform the same analysis on a subset of my dataset (about 300 rows) I get an error:
dataset <- read.csv("datasetNA.csv", sep=";", header=T)
names (dataset)
dataset2 <- dataset[complete.cases(dataset$response),]
library(randomForest)
dataset2 <- na.roughfix(dataset2)
data.rforest <- randomForest(dataset2$response ~ dataset2$predictorA + dataset2$predictorB+ dataset2$predictorC + dataset2$predictorD + dataset2$predictorE + dataset2$predictorF + dataset2$predictorG + dataset2$predictorH + dataset2$predictorI, data=dataset2, ntree=100, keep.forest=FALSE, importance=TRUE)
# subset of my original dataset:
groupA<-dataset2[dataset2$order=="groupA",]
data.rforest <- randomForest(groupA$response ~ groupA$predictorA + groupA$predictorB+ groupA$predictorC + groupA$predictorD + groupA$predictorE + groupA$predictorF + groupA$predictorG + groupA$predictorH + groupA$predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)
Error in randomForest.default(m, y, ...) : Can't have empty classes in y.
However, my response variable hasn't any empty class.
If instead I write randomForest like this (a+b+c,y) instead than (y ~ a+b+c) I get this other message:
Error in if (n == 0) stop("data (x) has 0 rows") :
argument length zero
Warning messages:
1: In Ops.factor(groupA$responseA + groupA$responseB, :
+ not meaningful for factors
The second problem is that when I try to impute my data through rfImpute() I get an error:
Errore in na.roughfix.default(x) : roughfix can only deal with numeric data
However my columns are all factors and numeric.
Can somebody see where I'm wrong???
Based on the discussion in the comments, here's a guess at a potential solution.
The confusion here arises from the fact that the levels of a factor are an attribute of the variable. Those levels will remain the same, no matter what subset you take of the data, no matter how small that subset. This is a feature, not a bug, and a common source of confusion.
If you want to drop missing levels when subsetting, wrap your subset operation in droplevels():
groupA <- droplevels(dataset2[dataset2$order=="groupA",])
I should probably also add that many R users set options(stringsAsFactors = FALSE) when starting a new session (e.g. in their .Rprofile file) to avoid these kinds of hassles. The downside to doing this is that if you share your code with other people frequently, this can cause problems if they haven't altered R's default options.
When factor levels are removed by subsetting, you must reset levels:
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "X" "Y" "b";
train11$str <- factor(train11$str);
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "b"
Try using the function formula before passing it to randomForest:
formula("y ~ a+b+c")
This fixed the problem for me.
Or it might that randomForest mistakes a parameter for another one.
Try specifying what each parameter is:
randomForest(,,, data=my_data, mtry=my_mtry, etc)
This is because you are sub setting your training set before sending the data to your random forest, and while sub setting there is a possibility of loosing some levels from your response variable after sub setting, therefore one need to reassign the factors by using this:
dataset2$response <- factor(dataset2$response)
to remove additional levels which are not present in data, after sub setting.
It seems the problem in the call statement. If you use formula interface then call
randomForest(response ~ predictorA + predictorB + ... + predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)
But it is more convenient and faster to explicitly pass x and y
randomForest(y = groupA$response, x = groupA[,c("predictorA", "predictorB", ...)], ntree=100, keep.forest=FALSE, importance=TRUE)
Instead of names of variables you can use their indices. Try these suggestions.
Just another suggestion to add to the mix: There is a chance that you don't want read.csv() to interpret strings as factors. Try adding this to read.csv to force conversion to characters:
dataset <- read.csv("datasetNA.csv",
sep=";",
header=T,
colClasses="character")
randomForest(x = data, y = label, importance = TRUE, ntree = 1000)
label is a factor, so use droplevels(label) to remove the levels with zero count before passing to randomForest function. It works.
To check the count for each level use table(label) function.
I had the same problem with you today and I had it solved. When you do Random Forest, the R default is classification while my response is numerical. When you use subsets as training dataset, the levels of the training are restricted compared with the test.
来源:https://stackoverflow.com/questions/13495041/random-forests-in-r-empty-classes-in-y-and-argument-legth-0