Should we used k-means++ instead of k-means?

不打扰是莪最后的温柔 提交于 2019-12-03 07:07:51

Nobody claims k-means++ runs in O(lg k) time; it's solution quality is O(lg k)-competitive with the optimal solution. Both k-means++ and the common method, called Lloyd's algorithm, are approximations to an NP-hard optimization problem.

I'm not sure what the worst case running time of k-means++ is; note that in Arthur & Vassilvitskii's original description, steps 2-4 of the algorithm refer to Lloyd's algorithm. They do claim that it works both better and faster in practice because it starts from a better position.

The drawbacks of k-means++ are thus:

  1. It too can find a suboptimal solution (it's still an approximation).
  2. It's not consistently faster than Lloyd's algorithm (see Arthur & Vassilvitskii's tables).
  3. It's more complicated than Lloyd's algo.
  4. It's relatively new, while Lloyd's has proven it's worth for over 50 years.
  5. Better algorithms may exist for specific metric spaces.

That said, if your k-means library supports k-means++, then by all means try it out.

denis

Not your question, but an easy speedup to any kmeans method for large N:

1) first do k-means on a random sample of say sqrt(N) of the points
2) then run full k-means from those centres.

I've found this 5-10 times faster than kmeans++ for N 10000, k 20, with similar results.
How well it works for you will depend on how well a sqrt(N) sample approximates the whole, as well as on N, dim, k, ninit, delta ...

What are your N (number of data points), dim (number of features), and k ?
The huge range in users' N, dim, k, data noise, metrics ... not to mention the lack of public benchmarks, make it tough to compare methods.

Added: Python code for kmeans() and kmeanssample() is here on SO; comments are welcome.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!