How to optimize shuffle spill in Apache Spark application

后端 未结 2 826
没有蜡笔的小新
没有蜡笔的小新 2020-12-04 07:38

I am running a Spark streaming application with 2 workers. Application has a join and an union operations.

All the batches are completing successfully but noticed th

2条回答
  •  孤街浪徒
    2020-12-04 08:08

    To add to the above answer, you may also consider increasing the default number (spark.sql.shuffle.partitions) of partitions from 200 (when shuffle occurs) to a number that will result in partitions of size close to the hdfs block size (i.e. 128mb to 256mb)

    If your data is skewed, try tricks like salting the keys to increase parallelism.

    Read this to understand spark memory management:

    https://0x0fff.com/spark-memory-management/

    https://www.tutorialdocs.com/article/spark-memory-management.html

提交回复
热议问题