Spark - repartition() vs coalesce()

前端 未结 14 2002
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

14条回答
  •  忘掉有多难
    2020-11-22 17:31

    For someone who had issues generating a single csv file from PySpark (AWS EMR) as an output and saving it on s3, using repartition helped. The reason being, coalesce cannot do a full shuffle, but repartition can. Essentially, you can increase or decrease the number of partitions using repartition, but can only decrease the number of partitions (but not 1) using coalesce. Here is the code for anyone who is trying to write a csv from AWS EMR to s3:

    df.repartition(1).write.format('csv')\
    .option("path", "s3a://my.bucket.name/location")\
    .save(header = 'true')
    

提交回复
热议问题