I\'m working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dea
@dbakr Did you get an answer for your biased prediction on your imbalanced dataset ?
Though I'm not sure it was your original plan, note that if you first subsample the majority class of your dataset by a ratio r, then, in order to get unbaised predictions for Spark's logistic regression, you can either:
- use the rawPrediction provided by the transform() function and adjust the intercept with log(r)
- or you can train your regression with weights using .setWeightCol("classWeightCol") (see the article cited here to figure out the value that must be set in the weights).