KMeans clustering in PySpark

前端 未结 2 1423
猫巷女王i
猫巷女王i 2020-12-23 08:58

I have a spark dataframe \'mydataframe\' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple value

2条回答
  •  误落风尘
    2020-12-23 09:09

    Despite my other general answer, and in case you, for whatever reason, must stick with MLlib & RDDs, here is what causes your error using the same toy df.

    When you select columns from a dataframe to convert to RDD, as you do, the result is an RDD of Rows:

    df.select('lat', 'long').rdd.collect()
    # [Row(lat=33.3, long=-17.5), Row(lat=40.4, long=-20.5), Row(lat=28.0, long=-23.9), Row(lat=29.5, long=-19.0), Row(lat=32.8, long=-18.84)]
    

    which is not suitable as an input to MLlib KMeans. You'll need a map operation for this to work:

    df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1])).collect()
    # [(33.3, -17.5), (40.4, -20.5), (28.0, -23.9), (29.5, -19.0), (32.8, -18.84)]
    

    So, your code should be like this:

    from pyspark.mllib.clustering import KMeans, KMeansModel
    
    rdd = df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1]))
    clusters = KMeans.train(rdd, 2, maxIterations=10, initializationMode="random") # works OK
    clusters.centers
    # [array([ 40.4, -20.5]), array([ 30.9 , -19.81])]
    

提交回复
热议问题