Is sklearn.cluster.KMeans sensative to data point order?

问题

As noted in the answer to this post about feature scaling, some(all?) implementations of KMeans are sensitive to the order of ~~features~~ data points. Based on the sklearn.cluster.KMeans documentation, n_init only changes the initial position of the centroid. This would mean that one must loop over a few shuffles of ~~features~~ data points to test if this is a problem. My questions are as follows:

Is the scikit-learn implementation sensitive to the ordering as the post suggest?
Does n_init take care of it for me?
If I am to to it myself should I take the best based on minimum inertia or take an average as suggested here?
Is there a good rule to know how many shuffle permutations is sufficient based on the number of data points?

UPDATE: The question initially asked about feature(column) order which is not an issue. This was a misinterpretation of the term "objects" in the linked post. It has been updated to ask about the data points (rows) order.

回答1:

K-means is not sensitive to feature order.

The post you refer to taken about scale, not order.

If you look at the kmeans equations, it should be obvious that the order does not matter.

There has been research (van Luxbourg, if I recall correctly) that essentially says that if there is a good kmeans result, then it must be easy to find. If you get very different results when running kmeans multiple times, then none of the results is good.

There are "n choose k" possible initializations. While they can't be all bad, n_iter will only try very few of them. So there is no guarantee to find the "best".the function will return the one with lowest SSQ, but that does not mean this is the most useful result in the end, unless you only care about SSQ.

来源：https://stackoverflow.com/questions/47604826/is-sklearn-cluster-kmeans-sensative-to-data-point-order

标签

python

scikit-learn

cluster-analysis

k-means