Random Forest with classes that are very unbalanced

后端 未结 4 1781
攒了一身酷
攒了一身酷 2020-12-05 05:40

I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters:



        
相关标签:
4条回答
  • 2020-12-05 06:02

    You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)

    One way of reducing the size of trees is to set the "nodesize" larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here's a thread in rhelp: https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html

    In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.

    0 讨论(0)
  • 2020-12-05 06:03

    Pretty sure I disagree with the idea of removing observations from your sample.

    Instead you might consider using a stratified sample to set a fixed percentage of each class each time it is resampled. This can be done with the Caret package. This way you will not be omitting observations by reducing the size of your training sample. It will not allow you to over represent your classes but will make sure that each subsample has a representative sample.

    Here is an example I found:

    len_pos <- nrow(example_dataset[example_dataset$target==1,])
    len_neg <- nrow(example_dataset[example_dataset$target==0,])
    
    train_model <- function(training_data, labels, model_type, ...) {
      experiment_control <- trainControl(method="repeatedcv",
                                         number = 10,
                                         repeats = 2,
                                         classProbs = T,
                                         summaryFunction = custom_summary_function)
      train(x = training_data,
            y = labels,
            method = model_type,
            metric = "custom_score",
            trControl = experiment_control,
            verbose = F,
            ...)
    }
    
    # strata refers to which feature to do stratified sampling on.
    # sampsize refers to the size of the bootstrap samples to be taken from each class. These samples will be taken as input
    # for each tree. 
    
    fit_results <- train_model(example_dataset
                               , as.factor(sprintf("c%d", as.numeric(example_dataset$target)))        
                               ,"rf"
                               ,tuneGrid = expand.grid(mtry = c( 3,5,10))
                               ,ntree=500
                               ,strata=as.factor(example_dataset$target)
                               ,sampsize = c('1'=as.integer(len_pos*0.25),'0'=as.integer(len_neg*0.8))
    )
    
    0 讨论(0)
  • 2020-12-05 06:05

    Sorry, I don't know how to post a comment on the earlier answer, so I'll create a separate answer.

    I suppose that the problem is caused by high imbalance of dataset (too few cases of one of the classes are present). For each tree in RF the algorithm creates bootstrap sample, which is a training set for this tree. And if you have too few examples of one of the classes in your dataset, then the bootstrap sampling will select examples of only one class (major class). And thus tree cannot be grown on only one class examples. It seems that there is a limit on 10 unsuccessful sampling attempts. So the proposition of DWin to reduce the degree of imbalance to lower values (1:100 or 1:10) is the most reasonable one.

    0 讨论(0)
  • 2020-12-05 06:08

    There are a few options.

    If you have a lot of data, set aside a random sample of the data. Build your model on one set, then use the other to determine a proper cutoff for the class probabilities using an ROC curve.

    You can also upsample the data in the minority class. The SMOTE algorithm might help (see the reference below and the DMwR package for a function).

    You can also use other techniques. rpart() and a few other functions can allow different costs on the errors, so you could favor the minority class more. You can bag this type of rpart() model to approximate what random forest is doing.

    ksvm() in the kernlab package can also use unbalanced costs (but the probability estimates are no longer good when you do this). Many other packages have arguments for setting the priors. You can also adjust this to put more emphasis on the minority class.

    One last thought: maximizing models based on accuracy isn't going to get you anywhere (you can get 99.99% off the bat). The caret can tune models based on the Kappa statistic, which is a much better choice in your case.

    0 讨论(0)
提交回复
热议问题