Spark - repartition() vs coalesce()

前端 未结 14 1951
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

14条回答
  •  温柔的废话
    2020-11-22 17:15

    One additional point to note here is that, as the basic principle of Spark RDD is immutability. The repartition or coalesce will create new RDD. The base RDD will continue to have existence with its original number of partitions. In case the use case demands to persist RDD in cache, then the same has to be done for the newly created RDD.

    scala> pairMrkt.repartition(10)
    res16: org.apache.spark.rdd.RDD[(String, Array[String])] =MapPartitionsRDD[11] at repartition at :26
    
    scala> res16.partitions.length
    res17: Int = 10
    
    scala>  pairMrkt.partitions.length
    res20: Int = 2
    

提交回复
热议问题