Why does groupByKey operation have always 200 tasks?

问题

Whenever I do a groupByKey on an RDD, it gets split up in 200 jobs, even if the original table is quite large, e.g. 2k partitions and tens of millions of rows.

Moreover, the operation seems to get stuck on the last two tasks which take extremely long to compute.

Why is it 200? How to increase it and will it help?

回答1:

This setting comes from spark.sql.shuffle.partitions, which is the number of partitions to use when grouping, and has a default setting of 200, but can be increased. This may help, it will be dependent on the cluster and data.

The last two tasks taking very long will be due to skewed data, those keys contain many more values. Can you use reduceByKey / combineByKey rather than groupByKey, or parallelize the problem differently?

来源：https://stackoverflow.com/questions/31265927/why-does-groupbykey-operation-have-always-200-tasks

标签

apache-spark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!