Spark Coalesce More Partitions
问题 I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file. In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen. From the docs it says coalesce should only be used if the number of output