Spark - repartition() vs coalesce()

前端 未结 14 2013
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

14条回答
  •  猫巷女王i
    2020-11-22 17:39

    Repartition: Shuffle the data into a NEW number of partitions.

    Eg. Initial data frame is partitioned in 200 partitions.

    df.repartition(500): Data will be shuffled from 200 partitions to new 500 partitions.

    Coalesce: Shuffle the data into existing number of partitions.

    df.coalesce(5): Data will be shuffled from remaining 195 partitions to 5 existing partitions.

提交回复
热议问题