发表新帖

发表新帖

How to optimize shuffle spill in Apache Spark application

后端未结

关注

 2  826

没有蜡笔的小新 2020-12-04 07:38

I am running a Spark streaming application with 2 workers. Application has a join and an union operations.

All the batches are completing successfully but noticed th

2条回答

孤街浪徒 (楼主)

2020-12-04 08:08

To add to the above answer, you may also consider increasing the default number (spark.sql.shuffle.partitions) of partitions from 200 (when shuffle occurs) to a number that will result in partitions of size close to the hdfs block size (i.e. 128mb to 256mb)

If your data is skewed, try tricks like salting the keys to increase parallelism.

Read this to understand spark memory management:

https://0x0fff.com/spark-memory-management/

https://www.tutorialdocs.com/article/spark-memory-management.html

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题