IllegalArgumentException when computing a PCA with Spark ML

问题

I have a parquet file containing the id and features columns and I want to apply the pca algorithm.

val dataset =  spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
    .setInputCols(Array("id", "features" ))
    .setOutputCol("features")
val pca = new PCA()
     .setInputCol("features")
     .setK(50)
     .fit(dataset)
     .setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")

but I have this exception

java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).

回答1:

Spark's PCA transformer needs a column created by a VectorAssembler. Here you create one but never use it. Also, the VectorAssembler only takes numbers as input. I don't know what the type of features is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler does not remove input columns and you will end up if two features columns.

Here is a working example of PCA computation in Spark:

import org.apache.spark.ml.feature._

val df = spark.range(10)
    .select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
    .setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
    .setInputCol("features").setOutputCol("pcaFeatures").setK(2)
    .fit(assembled_df)
val result = pca.transform(assembled_df)

来源：https://stackoverflow.com/questions/59922124/illegalargumentexception-when-computing-a-pca-with-spark-ml

标签

scala

apache-spark