Is sklearn.cluster.KMeans sensative to data point order?

若如初见. 提交于 2019-12-24 19:30:03

问题


As noted in the answer to this post about feature scaling, some(all?) implementations of KMeans are sensitive to the order of features data points. Based on the sklearn.cluster.KMeans documentation, n_init only changes the initial position of the centroid. This would mean that one must loop over a few shuffles of features data points to test if this is a problem. My questions are as follows:

  1. Is the scikit-learn implementation sensitive to the ordering as the post suggest?
  2. Does n_init take care of it for me?
  3. If I am to to it myself should I take the best based on minimum inertia or take an average as suggested here?
  4. Is there a good rule to know how many shuffle permutations is sufficient based on the number of data points?

UPDATE: The question initially asked about feature(column) order which is not an issue. This was a misinterpretation of the term "objects" in the linked post. It has been updated to ask about the data points (rows) order.


回答1:


K-means is not sensitive to feature order.

The post you refer to taken about scale, not order.

If you look at the kmeans equations, it should be obvious that the order does not matter.

There has been research (van Luxbourg, if I recall correctly) that essentially says that if there is a good kmeans result, then it must be easy to find. If you get very different results when running kmeans multiple times, then none of the results is good.

There are "n choose k" possible initializations. While they can't be all bad, n_iter will only try very few of them. So there is no guarantee to find the "best".the function will return the one with lowest SSQ, but that does not mean this is the most useful result in the end, unless you only care about SSQ.



来源:https://stackoverflow.com/questions/47604826/is-sklearn-cluster-kmeans-sensative-to-data-point-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!