Pyspark KMeans clustering features column IllegalArgumentException

点点圈 提交于 2019-12-23 04:10:11

问题


pyspark==2.4.0

Here is the code giving the exception:

LDA = spark.read.parquet('./LDA.parquet/')
LDA.printSchema()

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans(featuresCol='topic_vector_fix_dim').setK(15).setSeed(1)
model = kmeans.fit(LDA)

root
|-- Id: string (nullable = true)
|-- topic_vector_fix_dim: array (nullable = true)
| |-- element: double (containsNull = true)

IllegalArgumentException: 'requirement failed: Column topic_vector_fix_dim must be of type equal to one of the following types: [struct < type:tinyint,size:int,indices:array < int >,values:array < double > >, array < double >, array < float > ] but was actually of type array < double > .'

I am confused - it does not like my array <double>, but says that it may be the input.
Each entry of the topic_vector_fix_dim is a 1d array of floats


回答1:


containsNull of the features column should be set to False:

new_schema = ArrayType(DoubleType(), containsNull=False)
udf_foo = udf(lambda x:x, new_schema)
LDA = LDA.withColumn("topic_vector_fix_dim",udf_foo("topic_vector_fix_dim"))

After that everything works.




回答2:


The containsNull answer didn't work for me, but this did:

vectorAssembler = VectorAssembler(inputCols = ["x1", "x2", "x3"], outputCol = "features")
df = vectorAssembler.transform(df)
df = df.select(['features', 'Y'])


来源:https://stackoverflow.com/questions/55162989/pyspark-kmeans-clustering-features-column-illegalargumentexception

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!