Random forests in R (empty classes in y and argument legth 0)

荒凉一梦 提交于 2019-11-30 01:24:38

问题


I'm dealing for the first time with random forests and I'm having some troubles that I can't figure out.. When I run the analysis on all my dataset (about 3000 rows) I don't get any error message. But when I perform the same analysis on a subset of my dataset (about 300 rows) I get an error:

dataset <- read.csv("datasetNA.csv", sep=";", header=T)
names (dataset)
dataset2 <- dataset[complete.cases(dataset$response),]
library(randomForest)
dataset2 <- na.roughfix(dataset2)
data.rforest <- randomForest(dataset2$response ~ dataset2$predictorA + dataset2$predictorB+ dataset2$predictorC + dataset2$predictorD + dataset2$predictorE + dataset2$predictorF + dataset2$predictorG + dataset2$predictorH + dataset2$predictorI, data=dataset2, ntree=100, keep.forest=FALSE, importance=TRUE)

# subset of my original dataset:
groupA<-dataset2[dataset2$order=="groupA",]
data.rforest <- randomForest(groupA$response ~ groupA$predictorA + groupA$predictorB+ groupA$predictorC + groupA$predictorD + groupA$predictorE + groupA$predictorF + groupA$predictorG + groupA$predictorH + groupA$predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)

Error in randomForest.default(m, y, ...) : Can't have empty classes in y.

However, my response variable hasn't any empty class.

If instead I write randomForest like this (a+b+c,y) instead than (y ~ a+b+c) I get this other message:

Error in if (n == 0) stop("data (x) has 0 rows") : 
  argument length zero
Warning messages:
1: In Ops.factor(groupA$responseA + groupA$responseB,  :
  + not meaningful for factors

The second problem is that when I try to impute my data through rfImpute() I get an error:

Errore in na.roughfix.default(x) :  roughfix can only deal with numeric data

However my columns are all factors and numeric.

Can somebody see where I'm wrong???


回答1:


Based on the discussion in the comments, here's a guess at a potential solution.

The confusion here arises from the fact that the levels of a factor are an attribute of the variable. Those levels will remain the same, no matter what subset you take of the data, no matter how small that subset. This is a feature, not a bug, and a common source of confusion.

If you want to drop missing levels when subsetting, wrap your subset operation in droplevels():

groupA <- droplevels(dataset2[dataset2$order=="groupA",])

I should probably also add that many R users set options(stringsAsFactors = FALSE) when starting a new session (e.g. in their .Rprofile file) to avoid these kinds of hassles. The downside to doing this is that if you share your code with other people frequently, this can cause problems if they haven't altered R's default options.




回答2:


When factor levels are removed by subsetting, you must reset levels:

levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "X" "Y" "b";
train11$str <- factor(train11$str);
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "b"



回答3:


Try using the function formula before passing it to randomForest:

formula("y ~ a+b+c")

This fixed the problem for me.

Or it might that randomForest mistakes a parameter for another one.

Try specifying what each parameter is:

randomForest(,,, data=my_data, mtry=my_mtry, etc)



回答4:


This is because you are sub setting your training set before sending the data to your random forest, and while sub setting there is a possibility of loosing some levels from your response variable after sub setting, therefore one need to reassign the factors by using this:

dataset2$response <- factor(dataset2$response)

to remove additional levels which are not present in data, after sub setting.




回答5:


It seems the problem in the call statement. If you use formula interface then call

randomForest(response ~ predictorA + predictorB + ... + predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)

But it is more convenient and faster to explicitly pass x and y

randomForest(y = groupA$response, x = groupA[,c("predictorA", "predictorB", ...)], ntree=100, keep.forest=FALSE, importance=TRUE)

Instead of names of variables you can use their indices. Try these suggestions.




回答6:


Just another suggestion to add to the mix: There is a chance that you don't want read.csv() to interpret strings as factors. Try adding this to read.csv to force conversion to characters:

dataset <- read.csv("datasetNA.csv", 
                    sep=";", 
                    header=T,
                    colClasses="character")



回答7:


randomForest(x = data, y = label, importance = TRUE, ntree = 1000)

label is a factor, so use droplevels(label) to remove the levels with zero count before passing to randomForest function. It works.

To check the count for each level use table(label) function.




回答8:


I had the same problem with you today and I had it solved. When you do Random Forest, the R default is classification while my response is numerical. When you use subsets as training dataset, the levels of the training are restricted compared with the test.



来源:https://stackoverflow.com/questions/13495041/random-forests-in-r-empty-classes-in-y-and-argument-legth-0

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!