Spark, ML, StringIndexer: handling unseen labels

前端 未结 5 448
夕颜
夕颜 2020-12-08 05:12

My goal is to build a multicalss classifier.

I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each c

5条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-08 05:57

    There's a way around this in Spark 1.6.

    Here's the jira: https://issues.apache.org/jira/browse/SPARK-8764

    Here's an example:

    val categoryIndexerModel = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("indexedCategory")
      .setHandleInvalid("skip") // new method.  values are "error" or "skip"
    

    I started using this, but ended up going back to KrisP's 2nd bullet point about fitting this particular Estimator to the full dataset.

    You'll need this later in the pipeline when you convert the IndexToString.

    Here's the modified example:

    val categoryIndexerModel = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("indexedCategory")
      .fit(itemsDF) // Fit the Estimator and create a Model (Transformer)
    
    ... do some kind of classification ...
    
    val categoryReverseIndexer = new IndexToString()
      .setInputCol(classifier.getPredictionCol)
      .setOutputCol("predictedCategory")
      .setLabels(categoryIndexerModel.labels) // Use the labels from the Model
    

提交回复
热议问题