cforest party unbalanced classes

问题

I want to measure the features importance with the cforest function from the party library.

My output variable has something like 2000 samples in class 0 and 100 samples in class 1.

I think a good way to avoid bias due to class unbalance is to train each tree of the forest using a subsample such that the number of elements of class 1 is the same of the number of element in class 0.

Is there anyway to do that? I am thinking to an option like n_samples = c(20, 20)

EDIT: An example of code

   > iris.cf <- cforest(Species ~ ., data = iris, 
    +                    control = cforest_unbiased(mtry = 2)) #<--- Here I would like to train the forest using a balanced subsample of the data

 > varimp(object = iris.cf)
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
     0.048981818  0.002254545  0.305818182  0.271163636 
    >

EDIT: Maybe my question is not clear enough. Random forest is a set of decision trees. In general the decision trees are constructed using only a random subsample of the data. I would like that the used subsample has the same numbers of element in the class 1 and in the class 0.

EDIT: The function that I am looking for is for sure available in the randomForest package

sampsize    
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

I need the same for the party package. Is there any way to get it?

回答1:

I will assume you know what you want to accomplish, but don't know enough R to do that.

Not sure if the function provides balancing of data as an argument, but you can do it manually. Below is the code I quickly threw together. More elegant solution might exist.

# just in case
myData <- iris
# replicate everything *10* times. Replicate is just a "loop 10 times".
replicate(10,
    {   
        # split dataset by class and add separate classes to list
        splitList <- split(myData, myData$Species)
        # sample *20* random rows from each matrix in a list
        sampledList <- lapply(splitList, function(dat) { dat[sample(20),] })
        # combine sampled rows to a data.frame
        sampledData <- do.call(rbind, sampledList)

        # your code below
        res.cf <- cforest(Species ~ ., data = sampledData,
                          control = cforest_unbiased(mtry = 2)
                          )
        varimp(object = res.cf)
    }
)

Hope you can take it from here.

来源：https://stackoverflow.com/questions/26393675/cforest-party-unbalanced-classes

标签

random-forest

party