What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

后端未结

关注

 4  879

庸人自扰 2020-12-02 10:02

Hi I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of

4条回答

心在旅途 (楼主)

2020-12-02 10:25

OK so I think your issue is more general. It's not specific to Spark SQL, it's a general problem with Spark where it ignores the number of partitions you tell it when the files are few. Spark seems to have the same number of partitions as the number of files on HDFS, unless you call repartition. So calling repartition ought to work, but has the caveat of causing a shuffle somewhat unnecessarily.

I raised this question a while ago and have still yet to get a good answer :(

Spark: increase number of partitions without causing a shuffle?

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...