发表新帖

发表新帖

Spark: coalesce very slow even the output data is very small

后端未结

关注

 1  1372

时光取名叫无心

I have the following code in Spark:

myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .saveAsTextFile(\"myOutput\")

相关标签:

1条回答

情歌与酒

2020-12-09 05:44
if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.

So try by passing the true to coalesce function. i.e.
```
myData.filter(_.getMyEnum == null)
      .map(_.toString)
      .coalesce(1, shuffle = true)
      .saveAsTextFile("myOutput")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题