Spark, ML, StringIndexer: handling unseen labels

前端 未结 5 460
夕颜
夕颜 2020-12-08 05:12

My goal is to build a multicalss classifier.

I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each c

5条回答
  •  佛祖请我去吃肉
    2020-12-08 05:55

    To me, ignoring the rows completely by setting an argument (https://issues.apache.org/jira/browse/SPARK-8764) is not really feasible way to solve the issue.

    I ended up creating my own CustomStringIndexer transformer which will assign a new value for all new strings that were not encountered while training. You can also do this by changing the relevant portions of the spark feature code(just remove the if condition explicitly checking for this and make it return the length of the array instead) and recompile the jar.

    Not really an easy fix, but it certainly is a fix.

    I remember seeing a bug in JIRA to incorporate this as well: https://issues.apache.org/jira/browse/SPARK-17498

    It is set to be released with Spark 2.2 though. Just have to wait I guess :S

提交回复
热议问题