Spark: Find Each Partition Size for RDD

你离开我真会死。 提交于 2019-12-03 04:36:32

问题


What's the best way of finding each partition size for a given RDD. I'm trying to debug a skewed Partition issue, I've tried this:

l = builder.rdd.glom().map(len).collect()  # get length of each partition
print('Min Parition Size: ',min(l),'. Max Parition Size: ', max(l),'. Avg Parition Size: ', sum(l)/len(l),'. Total Partitions: ', len(l))

It works fine for small RDDs, but for bigger RDDs, it is giving OOM error. My idea is that glom() is causing this to happen. But anyway, just wanted to know if there is any better way to do it?


回答1:


Use:

builder.rdd.mapPartitions(lambda it: [sum(1 for _ in it)])



回答2:


While the answer by @LostInOverflow works great. I've found another way to find the size as well as index of each partition, using the code below. Thanks to this awesome post.

Here is the code:

l = test_join.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()

and then you can get the max and min size partitions using this code:

min(l,key=lambda item:item[1])
max(l,key=lambda item:item[1])

Finding the key of the skewed partition, we can further debug the content of the that partition, if needed.



来源:https://stackoverflow.com/questions/41068112/spark-find-each-partition-size-for-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!