KMeans clustering in PySpark

前端未结

关注

 2  1423

猫巷女王i 2020-12-23 08:58

I have a spark dataframe \'mydataframe\' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple value

2条回答

误落风尘 (楼主)

2020-12-23 09:09

Despite my other general answer, and in case you, for whatever reason, must stick with MLlib & RDDs, here is what causes your error using the same toy df.

When you select columns from a dataframe to convert to RDD, as you do, the result is an RDD of Rows:

df.select('lat', 'long').rdd.collect()
# [Row(lat=33.3, long=-17.5), Row(lat=40.4, long=-20.5), Row(lat=28.0, long=-23.9), Row(lat=29.5, long=-19.0), Row(lat=32.8, long=-18.84)]

which is not suitable as an input to MLlib KMeans. You'll need a map operation for this to work:

df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1])).collect()
# [(33.3, -17.5), (40.4, -20.5), (28.0, -23.9), (29.5, -19.0), (32.8, -18.84)]

So, your code should be like this:

from pyspark.mllib.clustering import KMeans, KMeansModel

rdd = df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1]))
clusters = KMeans.train(rdd, 2, maxIterations=10, initializationMode="random") # works OK
clusters.centers
# [array([ 40.4, -20.5]), array([ 30.9 , -19.81])]

0 讨论(0)

查看其它2个回答