Is Spark's KMeans unable to handle bigdata?

一笑奈何 提交于 2019-11-27 02:14:29

I think the 'hanging' is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization.

I opened https://issues.apache.org/jira/browse/SPARK-17389 to track two main improvements, one of which you can use now. Edit: really, see also https://issues.apache.org/jira/browse/SPARK-11560

First, there are some code optimizations that would speed up the init by about 13%.

However most of the issue is that it default to 5 steps of k-means|| init, when it seems that 2 is almost always just as good. You can set initialization steps to 2 to see a speedup, especially in the stage that's hanging now.

In my (smaller) test on my laptop, init time went from 5:54 to 1:41 with both changes, mostly due to setting init steps.

If your RDD is so large the collectAsMap will attempt to copy every single element in the RDD onto the single driver program, and then run out of memory and crash. Even though you had partitioned the data, the collectAsMap sends everything to the driver and you job crashs. You can make sure the number of elements you return is capped by calling take or takeSample, or perhaps filtering or sampling your RDD. Similarly, be cautious of these other actions as well unless you are sure your dataset size is small enough to fit in memory:

countByKey, countByValue, collect

If you really do need every one of these values of the RDD and the data is too big to fit into memory, you could write out the RDD to files or export the RDD to a database that is large enough to hold all the data. As you are using an API, I think you are not able to do that (rewrite all the code maybe? Increase Memory?). I think this collectAsMap in the runAlgorithm method is a really bad thing in Kmeans (https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html)...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!