How to optimize shuffle spill in Apache Spark application

后端未结

关注

 2  820

没有蜡笔的小新

I am running a Spark streaming application with 2 workers. Application has a join and an union operations.

All the batches are completing successfully but noticed th

相关标签:

2条回答

孤街浪徒

2020-12-04 08:08

To add to the above answer, you may also consider increasing the default number (spark.sql.shuffle.partitions) of partitions from 200 (when shuffle occurs) to a number that will result in partitions of size close to the hdfs block size (i.e. 128mb to 256mb)

If your data is skewed, try tricks like salting the keys to increase parallelism.

Read this to understand spark memory management:

https://0x0fff.com/spark-memory-management/

https://www.tutorialdocs.com/article/spark-memory-management.html

0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2020-12-04 08:21
Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you.

In summary, you spill when the size of the RDD partitions at the end of the stage exceed the amount of memory available for the shuffle buffer.

You can:
1. Manually repartition() your prior stage so that you have smaller partitions from input.
2. Increase the shuffle buffer by increasing the memory in your executor processes (spark.executor.memory)
3. Increase the shuffle buffer by increasing the fraction of executor memory allocated to it (spark.shuffle.memoryFraction) from the default of 0.2. You need to give back spark.storage.memoryFraction.
4. Increase the shuffle buffer per thread by reducing the ratio of worker threads (SPARK_WORKER_CORES) to executor memory
If there is an expert listening, I would love to know more about how the memoryFraction settings interact and their reasonable range.
0 讨论(0)
发布评论:

提交评论
- 加载中...