Spark - repartition() vs coalesce()

前端 未结 14 1948
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

14条回答
  •  时光取名叫无心
    2020-11-22 17:32

    It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

    So, it would go something like this:

    Node 1 = 1,2,3
    Node 2 = 4,5,6
    Node 3 = 7,8,9
    Node 4 = 10,11,12
    

    Then coalesce down to 2 partitions:

    Node 1 = 1,2,3 + (10,11,12)
    Node 3 = 7,8,9 + (4,5,6)
    

    Notice that Node 1 and Node 3 did not require its original data to move.

提交回复
热议问题