Handling imbalanced data in multi-class classification problem

删除回忆录丶 提交于 2021-02-16 15:24:08

问题


I have multi-class classification problem and data is heavily skewed. My target variable (y) has 3 classes and their % in data is as follows: - 0=3% - 1=90% - 2=7%

I am looking for Packages in R which can do multi-class oversampling, Undersampling or both the techniques.

If it is not doable in R then where I can handle this problem.?

PS: I tried using ROSE package in R but it works only for binary class problems.


回答1:


Well there is the caret-package which offers a wide range of ML-algorithms including for multi-class problems.

It also can apply down- and upsampling methods via: downSample(), upSample()

trainclass <- data.frame("label" = c(rep("class1", 100), rep("class2", 20), rep("class3", 180)),
                         "predictor1" = rnorm(300, 0 ,1),
                         "predictor2" = sample(c("this", "that"), 300, replace = TRUE))

> table(trainclass$label)
class1 class2 class3 
   100     20    180 

#then use
set.seed(234)
dtrain <- downSample(x = trainclass[, -1],
                     y = trainclass$label)

> table(dtrain$Class)
class1 class2 class3 
    20     20     20 

Nice feat: It can also do downsampling, upsampling as well as SMOTE and ROSE while applying resampling procedures (such as crossvalidation)

This performs 10-fold cross-validation using downsampling.

ctrl <- caret::trainControl(method = "cv",
                   number = 10,
                   verboseIter = FALSE,
                   summaryFunction = multiClassSummary
                   sampling = "down")

set.seed(42)
model_rf_under <- caret::train(Class ~ ., 
                               data = data,
                               method = "rf",
                               trControl = ctrl)

See further information here: https://topepo.github.io/caret/subsampling-for-class-imbalances.html

Also Check out the mlr-package: https://mlr.mlr-org.com/articles/tutorial/over_and_undersampling.html#sampling-based-approaches




回答2:


You can use SMOTE function under DMwR packages. I have created a sample dataset and make three Imbalance class..

install.packages("DMwR")
library(DMwR)

## A small example with a data set created artificially from the IRIS
## data 
data(iris)

#setosa 90%, versicolor 3% and virginica 7%
Species<-c(rep("setosa",135),rep("versicolor",5),rep("virginica",10))
data<-cbind(iris[,1:4],Species)
table(data$Species)

Imbalance class:

setosa versicolor  virginica 
  135       5         10 

Now, for recovering 2 imbalance class, apply SMOTE functions 2 times on data...

First_Imbalence_recover <- DMwR::SMOTE(Species ~ ., data, perc.over = 2000,perc.under=100)

Final_Imbalence_recover <- DMwR::SMOTE(Species ~ ., First_Imbalence_recover, perc.over = 2000,perc.under=200)
table(Final_Imbalence_recover$Species)

Final balance class:

setosa versicolor  virginica 
    79         81         84

NOTE: These examples will be generated by using the information from the k nearest neighbors of each example of the minority class. The parameter k controls how many of these neighbors are used. So, the class may vary every run, which shouldn't affect overall balancing.



来源:https://stackoverflow.com/questions/54779380/handling-imbalanced-data-in-multi-class-classification-problem

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!