Dealing with unbalanced datasets in Spark MLlib

前端未结

关注

 3  1297

孤街浪徒 2020-12-12 13:28

I\'m working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dea

3条回答

南方客 (楼主)

2020-12-12 14:16

I used the solution by @Serendipity, but we can optimize the balanceDataset function to avoid using a udf. I also added the ability to change the label column being used. This is the version of the function I ended up with:

def balanceDataset(dataset: DataFrame, label: String = "label"): DataFrame = {
  // Re-balancing (weighting) of records to be used in the logistic loss objective function
  val (datasetSize, positives) = dataset.select(count("*"), sum(dataset(label))).as[(Long, Double)].collect.head
  val balancingRatio = positives / datasetSize

  val weightedDataset = {
    dataset.withColumn("classWeightCol", when(dataset(label) === 0.0, balancingRatio).otherwise(1.0 - balancingRatio))
  }
  weightedDataset
}

We create the classifier as he stated wtih:

new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")

0 讨论(0)

查看其它3个回答