My goal is to build a multicalss classifier.
I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each c
To me, ignoring the rows completely by setting an argument (https://issues.apache.org/jira/browse/SPARK-8764) is not really feasible way to solve the issue.
I ended up creating my own CustomStringIndexer transformer which will assign a new value for all new strings that were not encountered while training. You can also do this by changing the relevant portions of the spark feature code(just remove the if condition explicitly checking for this and make it return the length of the array instead) and recompile the jar.
Not really an easy fix, but it certainly is a fix.
I remember seeing a bug in JIRA to incorporate this as well: https://issues.apache.org/jira/browse/SPARK-17498
It is set to be released with Spark 2.2 though. Just have to wait I guess :S