I aim to apply a kmeans clustering algorithm to a very large data set using Spark (1.3.1) MLLib. I have called the data from an HDFS using a hiveContext from Spark, and woul
I'm doing something similar using pySpark. I'm guessing you could directly translate this to Scala as there is nothing python specific. myPointsWithID is my RDD with an ID for each point and the point represented as an array of values.
# Get an RDD of only the vectors representing the points to be clustered
points = myPointsWithID.map(lambda (id, point): point)
clusters = KMeans.train(points,
100,
maxIterations=100,
runs=50,
initializationMode='random')
# For each point in the original RDD, replace the point with the
# ID of the cluster the point belongs to.
clustersBC = sc.broadcast(clusters)
pointClusters = myPointsWithID.map(lambda (id, point): (id, clustersBC.value.predict(point)))