Dealing with unbalanced datasets in Spark MLlib

前端 未结 3 1297
孤街浪徒
孤街浪徒 2020-12-12 13:28

I\'m working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dea

3条回答
  •  南方客
    南方客 (楼主)
    2020-12-12 14:16

    I used the solution by @Serendipity, but we can optimize the balanceDataset function to avoid using a udf. I also added the ability to change the label column being used. This is the version of the function I ended up with:

    def balanceDataset(dataset: DataFrame, label: String = "label"): DataFrame = {
      // Re-balancing (weighting) of records to be used in the logistic loss objective function
      val (datasetSize, positives) = dataset.select(count("*"), sum(dataset(label))).as[(Long, Double)].collect.head
      val balancingRatio = positives / datasetSize
    
      val weightedDataset = {
        dataset.withColumn("classWeightCol", when(dataset(label) === 0.0, balancingRatio).otherwise(1.0 - balancingRatio))
      }
      weightedDataset
    }
    

    We create the classifier as he stated wtih:

    new LogisticRegression().setWeightCol("classWeightCol").setLabelCol("label").setFeaturesCol("features")
    

提交回复
热议问题