问题
I have a parquet file containing the id
and features
columns and I want to apply the pca algorithm.
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val pca = new PCA()
.setInputCol("features")
.setK(50)
.fit(dataset)
.setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")
but I have this exception
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).
回答1:
Spark's PCA transformer needs a column created by a VectorAssembler
. Here you create one but never use it. Also, the VectorAssembler
only takes numbers as input. I don't know what the type of features
is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler
does not remove input columns and you will end up if two features
columns.
Here is a working example of PCA computation in Spark:
import org.apache.spark.ml.feature._
val df = spark.range(10)
.select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
.setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
.setInputCol("features").setOutputCol("pcaFeatures").setK(2)
.fit(assembled_df)
val result = pca.transform(assembled_df)
来源:https://stackoverflow.com/questions/59922124/illegalargumentexception-when-computing-a-pca-with-spark-ml