I\'m working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dea
I used the solution by @Serendipity, but we can optimize the balanceDataset function to avoid using a udf. I also added the ability to change the label column being used. This is the version of the function I ended up with:
def balanceDataset(dataset: DataFrame, label: String = "label"): DataFrame = {
// Re-balancing (weighting) of records to be used in the logistic loss objective function
val (datasetSize, positives) = dataset.select(count("*"), sum(dataset(label))).as[(Long, Double)].collect.head
val balancingRatio = positives / datasetSize
val weightedDataset = {
dataset.withColumn("classWeightCol", when(dataset(label) === 0.0, balancingRatio).otherwise(1.0 - balancingRatio))
}
weightedDataset
}
We create the classifier as he stated wtih:
new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")