问题
I am using the MiniBatchKMeans model from the sklearn.cluster module in anaconda. I am clustering a data-set that contains approximately 75,000 points. It looks something like this:
data = np.array([8,3,1,17,5,21,1,7,1,26,323,16,2334,4,2,67,30,2936,2,16,12,28,1,4,190...])
I fit the data using the process below.
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(batch_size=100)
kmeans.fit(data.reshape(-1,1)
This is all well and okay, and I proceed to find the centroids of the data:
centroids = kmeans.cluster_centers_
print centroids
Which gives me the following output:
array([[   13.09716569],
       [ 2908.30379747],
       [   46.05089228],
       [  725.83453237],
       [   95.39868475],
       [ 1508.38356164],
       [  175.48099948],
       [  350.76287263]])
But, when I run the process again, using the same data, I get different values for the centroids, such as this:
array([[   29.63143489],
       [ 1766.7244898 ],
       [  171.04417206],
       [ 2873.70454545],
       [   70.05295277],
       [ 1074.50387597],
       [  501.36134454],
       [    8.30600975]])
Can anyone explain why this is?
回答1:
The behavior you are experiencing probably has to do with the under the hood implementation of k-means clustering that you are using. k-means clustering is an NP-hard problem, so all the implementations out there are heuristic methods. What this means practically is that for a given seed, it will converge toward a local optima that isn't necessarily consistent across multiple seeds.
回答2:
Read up on what mini-batch k-means is.
It will never even converge. Do one more iteration, the result will change again.
It is design for data sets so huge you cannot load them into memory at once. So you load a batch, pretend this were the full data set, do one iterarion. Repeat woth the next batch. If your batches are large enough and random, then the result will be "close enough" to be usable. While itjs never optimal.
Thus:
- the minibatch results are even more random than regular k-means results. They change every iteration.
- if you can load your data into memory, don't use minibatch. Instead use a fast k-means implementation. (most are surprisingly slow).
P.S. on one-dimensional data, sort your data set and then use an algorithm that benefits from the sorting instead of k-means.
来源:https://stackoverflow.com/questions/38751364/minibatchkmeans-gives-different-centroids-after-subsequent-iterations