SparkException: Values to assemble cannot be null

女生的网名这么多〃 提交于 2019-11-29 02:07:48

Spark >= 2.4

Since Spark 2.4 VectorAssembler extends HasHandleInvalid. It means you can skip:

assembler.setHandleInvalid("skip").transform(df).show
+---+---+---------+
| x1| x2| features|
+---+---+---------+
|3.0|4.0|[3.0,4.0]|
+---+---+---------+

keep (note that ML algorithms are unlikely to handle this correctly):

assembler.setHandleInvalid("keep").transform(df).show
+----+----+---------+
|  x1|  x2| features|
+----+----+---------+
| 1.0|null|[1.0,NaN]|
|null| 2.0|[NaN,2.0]|
| 3.0| 4.0|[3.0,4.0]|
+----+----+---------+

or default to error.

Spark < 2.4

There is nothing wrong with VectorAssembler. Spark Vector just cannot contain null values.

import org.apache.spark.ml.feature.VectorAssembler

val df = Seq(
  (Some(1.0), None), (None, Some(2.0)), (Some(3.0), Some(4.0))
).toDF("x1", "x2")

val assembler = new VectorAssembler()
  .setInputCols(df.columns).setOutputCol("features")

assembler.transform(df).show(3)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$3: (struct<x1:double,x2:double>) => vector)
...
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.

Null are not meaningful for ML algorithms and cannot be represented using scala.Double.

You have to either drop:

assembler.transform(df.na.drop).show(2)
+---+---+---------+
| x1| x2| features|
+---+---+---------+
|3.0|4.0|[3.0,4.0]|
+---+---+---------+

or fill / impute (see also Replace missing values with mean - Spark Dataframe):

// For example with averages
val replacements: Map[String,Any] = Map("x1" -> 2.0, "x2" -> 3.0)
assembler.transform(df.na.fill(replacements)).show(3)
+---+---+---------+
| x1| x2| features|
+---+---+---------+
|1.0|3.0|[1.0,3.0]|
|2.0|2.0|[2.0,2.0]|
|3.0|4.0|[3.0,4.0]|
+---+---+---------+

nulls.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!