My goal is to build a multicalss classifier.
I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each c
There's a way around this in Spark 1.6.
Here's the jira: https://issues.apache.org/jira/browse/SPARK-8764
Here's an example:
val categoryIndexerModel = new StringIndexer()
.setInputCol("category")
.setOutputCol("indexedCategory")
.setHandleInvalid("skip") // new method. values are "error" or "skip"
I started using this, but ended up going back to KrisP's 2nd bullet point about fitting this particular Estimator to the full dataset.
You'll need this later in the pipeline when you convert the IndexToString.
Here's the modified example:
val categoryIndexerModel = new StringIndexer()
.setInputCol("category")
.setOutputCol("indexedCategory")
.fit(itemsDF) // Fit the Estimator and create a Model (Transformer)
... do some kind of classification ...
val categoryReverseIndexer = new IndexToString()
.setInputCol(classifier.getPredictionCol)
.setOutputCol("predictedCategory")
.setLabels(categoryIndexerModel.labels) // Use the labels from the Model