IllegalArgumentException when computing a PCA with Spark ML

試著忘記壹切 提交于 2021-02-10 05:08:42

问题


I have a parquet file containing the id and features columns and I want to apply the pca algorithm.

val dataset =  spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
    .setInputCols(Array("id", "features" ))
    .setOutputCol("features")
val pca = new PCA()
     .setInputCol("features")
     .setK(50)
     .fit(dataset)
     .setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")

but I have this exception

java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).


回答1:


Spark's PCA transformer needs a column created by a VectorAssembler. Here you create one but never use it. Also, the VectorAssembler only takes numbers as input. I don't know what the type of features is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler does not remove input columns and you will end up if two features columns.

Here is a working example of PCA computation in Spark:

import org.apache.spark.ml.feature._

val df = spark.range(10)
    .select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
    .setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
    .setInputCol("features").setOutputCol("pcaFeatures").setK(2)
    .fit(assembled_df)
val result = pca.transform(assembled_df)


来源:https://stackoverflow.com/questions/59922124/illegalargumentexception-when-computing-a-pca-with-spark-ml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!