Spark - repartition() vs coalesce()

前端 未结 14 2022
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

14条回答
  •  说谎
    说谎 (楼主)
    2020-11-22 17:36

    To all the great answers I would like to add that repartition is one the best option to take advantage of data parallelization. While coalesce gives a cheap option to reduce the partitions and it is very useful when writing data to HDFS or some other sink to take advantage of big writes.

    I have found this useful when writing data in parquet format to get full advantage.

提交回复
热议问题