Neural Network - Working with a imbalanced dataset

孤人 提交于 2019-12-05 22:26:00

There are usually two common ways for imbanlanced dataset:

  1. Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.

  2. Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection

I will expand a bit on chasep's answer. If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as @chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):

 L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos} 

With \alpha greater than 1.

Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :

cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))

Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.

Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives. I would go with the first option as it is slightly easier to do with TF.

One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.

Yes - neural network could help in your case. There are at least two approaches to such problem:

  1. Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
  2. Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
  3. You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!