Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition.
Something like this:
pzecevic's answer works, but conceptually there's no need to construct an array and then convert it to an iterator. I would just construct the iterator directly and then get the counts with a collect call.
rdd.mapPartitions(iter => Iterator(iter.size), true).collect()
P.S. Not sure if his answer is actually doing more work since Iterator.apply will likely convert its arguments into an array.