Inconsistent results with KMeans between Apache Spark and scikit_learn

大憨熊 提交于 2019-12-05 17:31:45

There is nothing wrong with WSSE not decreasing monotonically. In theory WSSE must decrease monotonically if the cluster is optimal, that means that from all the possible k-centers clusters the one with the best WSSE.

The problem is that K-means is not necessarily able to find the optimal clustering for a given k. Its iterative process can converge from a random starting point to a local minimum, which may be good but is not optimal.

There are methods like K-means++ and Kmeans|| that have variants of selection algorithms that are more likely to choose diverse, separated centroids and lead more reliably to a good clustering and Spark MLlib, in fact, implements K-means||. However, all still have an element of randomness in selection and can’t guarantee an optimal clustering.

The random starting set of clusters chosen for k=6 perhaps led to a particularly suboptimal clustering, or it may have stopped early before it reached its local optimum.

You can improve it by changing the parameters of Kmeans manually. The algorithm has a threshold via tol that controls the minimum amount of cluster centroid movement considered significant, where lower values mean the K-means algorithm will let the centroids continue to move longer.

Increasing the maximum number of iterations with maxIter also prevents it from potentially stopping too early at the cost of possibly more computation.

So my advice is to re-run you clustering with

 ...
 #increase from default 20
 max_iter= 40     
 #decrase from default 0.0001
 tol = 0.00001 
 km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col, k=i, maxIter=max_iter, seed=seed , tol = tol )
 ...
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!