How to get the number of elements in partition?
问题 Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition. Something like this: Rdd.partitions().get(index).size() Except I don't see such an API for spark. Any ideas? workarounds? Thanks 回答1: The following gives you a new RDD with elements that are the sizes of each partition: rdd.mapPartitions(iter => Array(iter.size).iterator, true) 回答2: PySpark: num_partitions = 20000 a = sc.parallelize(range(int(1e6)), num