Error when passing data from a Dataframe into an existing ML VectorIndexerModel

一曲冷凌霜 提交于 2019-11-28 14:15:26
zero323

This happens because PipelineModel includes VectorIndexerModel and features contain unseen levels in one of the columns marked as categorical. You can easily reproduce the same error as follows:

val train = Seq((1L, Vectors.dense(0.0))).toDF("id", "foo")
val test = Seq((1L, Vectors.dense(1.0))).toDF("id", "foo")

new VectorIndexer().setInputCol("foo").setOutputCol("bar")
  .fit(train).transform(test).first

As of today VectorIndexer (Spark 2.2) Spark doesn't support handling unseen levels in VectorIndexer (as it does with StringIndexer) but this functionality is planned for the future.

Edit:

In Spark 2.3 you can use handleInvalid, for example:

new VectorIndexer()
  .setInputCol("foo").setOutputCol("bar")
  .setHandleInvalid("keep")
 .fit(train).transform(test).first
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!