Spark Coalesce More Partitions

问题

I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file.

In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen.

From the docs it says coalesce should only be used if the number of output partitions is less than the input but what happens if it isn't, it doesn't seem to cause an error? Does it cause the data to be incorrect or performance problems?

I am trying to avoid having to do a count of my RDD to determine if I have more partitions than my output limit and if so coalesce.

回答1:

With default PartitionCoalescer, if number of partitions is larger than a current number of partitions and you don't set shuffle to true then number of partitions stays unchanged.

coalesce with shuffle set to true from the other hand is equivalent to repartition with the same value of numPartitions.

来源：https://stackoverflow.com/questions/37597798/spark-coalesce-more-partitions

标签

apache-spark

rdd

coalesce

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!