spark.ml StringIndexer throws 'Unseen label' on fit()

允我心安 提交于 2019-11-26 19:06:50
zero323

Unseen label is a generic message which doesn't correspond to a specific column. Most likely problem is with a following stage:

StringIndexer(inputCol='lang', outputCol='lang_idx')

with pl-PL present in train("lang") and not present in test("lang").

You can correct it using setHandleInvalid with skip:

from pyspark.ml.feature import StringIndexer

train = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["k", "v"])
test = sc.parallelize([(3, "foo"), (4, "foobar")]).toDF(["k", "v"])

indexer = StringIndexer(inputCol="v", outputCol="vi")
indexer.fit(train).transform(test).show()

## Py4JJavaError: An error occurred while calling o112.showString.
## : org.apache.spark.SparkException: Job aborted due to stage failure: 
##   ...
##   org.apache.spark.SparkException: Unseen label: foobar.

indexer.setHandleInvalid("skip").fit(train).transform(test).show()

## +---+---+---+
## |  k|  v| vi|
## +---+---+---+
## |  3|foo|1.0|
## +---+---+---+

or, in the latest versions, keep:

indexer.setHandleInvalid("keep").fit(train).transform(test).show()

## +---+------+---+
## |  k|     v| vi|
## +---+------+---+
## |  3|   foo|0.0|
## |  4|foobar|2.0|
## +---+------+---+

Okay I think I got this. At least I got this working.

Caching the dataframe(including train/test partes) solves the problem. That's what I found in this JIRA issue: https://issues.apache.org/jira/browse/SPARK-12590.

So it's not a bug, just the fact that randomSample might yield a different result on the same, but differently partitioned dataset. And apparently, some of my munging functions (or Pipeline) involve repartition, therefore, results of the trainset recomputation from its definition might diverge.

What still interests me - it's the reproducibility: it's always 'pl-PL' row that gets mixed in the wrong part of the dataset, i.e. it's not random repartition. It's deterministic, just inconsistent. I wonder how exactly it happens.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!