I\'m working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dea
@dbakr Did you get an answer for your biased prediction on your imbalanced dataset ?
Though I'm not sure it was your original plan, note that if you first subsample the majority class of your dataset by a ratio r, then, in order to get unbaised predictions for Spark's logistic regression, you can either:
- use the rawPrediction provided by the transform()
function and adjust the intercept with log(r)
- or you can train your regression with weights using .setWeightCol("classWeightCol")
(see the article cited here to figure out the value that must be set in the weights).
I used the solution by @Serendipity, but we can optimize the balanceDataset function to avoid using a udf. I also added the ability to change the label column being used. This is the version of the function I ended up with:
def balanceDataset(dataset: DataFrame, label: String = "label"): DataFrame = {
// Re-balancing (weighting) of records to be used in the logistic loss objective function
val (datasetSize, positives) = dataset.select(count("*"), sum(dataset(label))).as[(Long, Double)].collect.head
val balancingRatio = positives / datasetSize
val weightedDataset = {
dataset.withColumn("classWeightCol", when(dataset(label) === 0.0, balancingRatio).otherwise(1.0 - balancingRatio))
}
weightedDataset
}
We create the classifier as he stated wtih:
new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")
As of this very moment, the class weighting for the Random Forest algorithm is still under development (see here)
But If you're willing to try other classifiers - this functionality has been already added to the Logistic Regression.
Consider a case where we have 80% positives (label == 1) in the dataset, so theoretically we want to "under-sample" the positive class. The logistic loss objective function should treat the negative class (label == 0) with higher weight.
Here is an example in Scala of generating this weight, we add a new column to the dataframe for each record in the dataset:
def balanceDataset(dataset: DataFrame): DataFrame = {
// Re-balancing (weighting) of records to be used in the logistic loss objective function
val numNegatives = dataset.filter(dataset("label") === 0).count
val datasetSize = dataset.count
val balancingRatio = (datasetSize - numNegatives).toDouble / datasetSize
val calculateWeights = udf { d: Double =>
if (d == 0.0) {
1 * balancingRatio
}
else {
(1 * (1.0 - balancingRatio))
}
}
val weightedDataset = dataset.withColumn("classWeightCol", calculateWeights(dataset("label")))
weightedDataset
}
Then, we create a classier as follow:
new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")
For more details, watch here: https://issues.apache.org/jira/browse/SPARK-9610
A different issue you should check - whether your features have a "predictive power" for the label you're trying to predict. In a case where after under-sampling you still have low precision, maybe that has nothing to do with the fact that your dataset is imbalanced by nature.
I would do a exploratory data analysis - If the classifier doesn't do better than a random choice, there is a risk that there simply is no connection between features and class.
Overfitting - a low error on your training set and a high error on your test set might be an indication that you overfit using an overly flexible feature set.
Bias variance - Check whether your classifier suffers from a high bias or high variance problem.