How to get the number of elements in partition?

后端 未结 3 765
清歌不尽
清歌不尽 2020-12-05 11:14

Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition.

Something like this:

3条回答
  •  猫巷女王i
    2020-12-05 11:47

    pzecevic's answer works, but conceptually there's no need to construct an array and then convert it to an iterator. I would just construct the iterator directly and then get the counts with a collect call.

    rdd.mapPartitions(iter => Iterator(iter.size), true).collect()
    

    P.S. Not sure if his answer is actually doing more work since Iterator.apply will likely convert its arguments into an array.

提交回复
热议问题