Spark, ML, StringIndexer: handling unseen labels

前端未结

关注

 5  448

夕颜 2020-12-08 05:12

My goal is to build a multicalss classifier.

I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each c

5条回答

小蘑菇 (楼主)

2020-12-08 05:57

There's a way around this in Spark 1.6.

Here's the jira: https://issues.apache.org/jira/browse/SPARK-8764

Here's an example:

val categoryIndexerModel = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("indexedCategory")
  .setHandleInvalid("skip") // new method.  values are "error" or "skip"

I started using this, but ended up going back to KrisP's 2nd bullet point about fitting this particular Estimator to the full dataset.

You'll need this later in the pipeline when you convert the IndexToString.

Here's the modified example:

val categoryIndexerModel = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("indexedCategory")
  .fit(itemsDF) // Fit the Estimator and create a Model (Transformer)

... do some kind of classification ...

val categoryReverseIndexer = new IndexToString()
  .setInputCol(classifier.getPredictionCol)
  .setOutputCol("predictedCategory")
  .setLabels(categoryIndexerModel.labels) // Use the labels from the Model

0 讨论(0)

查看其它5个回答