How to optimize shuffle spill in Apache Spark application
问题 I am running a Spark streaming application with 2 workers. Application has a join and an union operations. All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times). Please find the spark stage details in the below image: After researching on this, found that Shuffle spill happens when there is not sufficient memory for shuffle data. Shuffle spill (memory) - size of the