Spark MLLib Kmeans from dataframe, and back again

后端 未结 4 1465
被撕碎了的回忆
被撕碎了的回忆 2020-12-28 23:52

I aim to apply a kmeans clustering algorithm to a very large data set using Spark (1.3.1) MLLib. I have called the data from an HDFS using a hiveContext from Spark, and woul

4条回答
  •  星月不相逢
    2020-12-29 00:22

    I'm doing something similar using pySpark. I'm guessing you could directly translate this to Scala as there is nothing python specific. myPointsWithID is my RDD with an ID for each point and the point represented as an array of values.

    # Get an RDD of only the vectors representing the points to be clustered
    points = myPointsWithID.map(lambda (id, point): point)
    clusters = KMeans.train(points, 
                            100, 
                            maxIterations=100, 
                            runs=50,
                            initializationMode='random')
    
    # For each point in the original RDD, replace the point with the
    # ID of the cluster the point belongs to. 
    clustersBC = sc.broadcast(clusters)
    pointClusters = myPointsWithID.map(lambda (id, point): (id, clustersBC.value.predict(point)))
    

提交回复
热议问题