So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a
filteror some
In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
In my opinion:
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.union, coGroup, join) can affect the number of partitions.