MiniBatchKMeans gives different centroids after subsequent iterations

巧了我就是萌 提交于 2020-01-07 02:54:53

问题


I am using the MiniBatchKMeans model from the sklearn.cluster module in anaconda. I am clustering a data-set that contains approximately 75,000 points. It looks something like this:

data = np.array([8,3,1,17,5,21,1,7,1,26,323,16,2334,4,2,67,30,2936,2,16,12,28,1,4,190...])

I fit the data using the process below.

from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(batch_size=100) kmeans.fit(data.reshape(-1,1)

This is all well and okay, and I proceed to find the centroids of the data:

centroids = kmeans.cluster_centers_ print centroids

Which gives me the following output:

array([[ 13.09716569], [ 2908.30379747], [ 46.05089228], [ 725.83453237], [ 95.39868475], [ 1508.38356164], [ 175.48099948], [ 350.76287263]])

But, when I run the process again, using the same data, I get different values for the centroids, such as this:

array([[ 29.63143489], [ 1766.7244898 ], [ 171.04417206], [ 2873.70454545], [ 70.05295277], [ 1074.50387597], [ 501.36134454], [ 8.30600975]])

Can anyone explain why this is?


回答1:


The behavior you are experiencing probably has to do with the under the hood implementation of k-means clustering that you are using. k-means clustering is an NP-hard problem, so all the implementations out there are heuristic methods. What this means practically is that for a given seed, it will converge toward a local optima that isn't necessarily consistent across multiple seeds.




回答2:


Read up on what mini-batch k-means is.

It will never even converge. Do one more iteration, the result will change again.

It is design for data sets so huge you cannot load them into memory at once. So you load a batch, pretend this were the full data set, do one iterarion. Repeat woth the next batch. If your batches are large enough and random, then the result will be "close enough" to be usable. While itjs never optimal.

Thus:

  1. the minibatch results are even more random than regular k-means results. They change every iteration.
  2. if you can load your data into memory, don't use minibatch. Instead use a fast k-means implementation. (most are surprisingly slow).

P.S. on one-dimensional data, sort your data set and then use an algorithm that benefits from the sorting instead of k-means.



来源:https://stackoverflow.com/questions/38751364/minibatchkmeans-gives-different-centroids-after-subsequent-iterations

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!