Why is Spark Mllib KMeans algorithm extremely slow?

拟墨画扇 提交于 2019-11-29 14:20:47
zero323

It looks like the reason is relatively simple. You use quite large k and combine it with an expensive initialization algorithm.

By default Spark is using as distributed variant of K-means++ called K-means|| (see What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?). Distributed version is roughly O(k) so with larger k you can expect slower start. This should explain why you see no improvement when you reduce number of iterations.

Using large K is also expensive when model is trained. Spark is using a variant of Lloyds which is roughly O(nkdi).

If you expect complex structure of the data there most likely a better algorithms out there to handle this than K-Means but if you really want to stick with it you start with using random initialization.

Please try other implementations of k-means. Some like the variants in ELKI are way better than Spark, even on only a single CPU. You will be surprised how much performance you can get out of a single node, without going to a cluster! From my experiments, you would need at least a 100 node cluster to beat good local implementations, unfortunately.

I read that these C++ versions are multi-core (but single-node) and probably the fastest K-means you can find right now, but I have not yet tried that myself yet (for all my needs, the ELKI versions were bazingly fast, finishing in a few seconds on my largest data sets).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!