According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of
For someone who had issues generating a single csv file from PySpark (AWS EMR) as an output and saving it on s3, using repartition helped. The reason being, coalesce cannot do a full shuffle, but repartition can. Essentially, you can increase or decrease the number of partitions using repartition, but can only decrease the number of partitions (but not 1) using coalesce. Here is the code for anyone who is trying to write a csv from AWS EMR to s3:
df.repartition(1).write.format('csv')\
.option("path", "s3a://my.bucket.name/location")\
.save(header = 'true')