repartition() is not affecting RDD partition size

陌路散爱 提交于 2021-02-18 12:17:07

问题


I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:-

scala> rdd.partitions.size
res56: Int = 50

scala> rdd.repartition(10)
res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27

At this stage I perform some action like rdd.take(1) just to force evaluation, just in case if that matters. And then I again check the partition size:-

scala> rdd.partitions.size
res58: Int = 50

As one can see, it's not changing. Can someone answer why?


回答1:


First, it does matter that you run an action as repartition is indeed lazy. Second, repartition returns a new RDD with the partitioning changed, so you must use the returned RDD or else you are still working off of the old partitioning. Finally, when shrinking your partitions, you should use coalesce, as that will not reshuffle the data. It will instead keep data on the number of nodes and pull in the remaining orphans.



来源:https://stackoverflow.com/questions/31508345/repartition-is-not-affecting-rdd-partition-size

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!