Spark : how can evenly distribute my records in all partition

前端未结

关注

 2  1301

难免孤独 2021-01-06 14:16

I have a RDD with 30 record (key/value pair : key is Time Stamp and Value is JPEG Byte Array)
and I am running 30 executors. I want to repartition this RDD in to 30 pa

2条回答

Happy的楠姐 (楼主)

2021-01-06 14:36
You can force a new partitioning by using the partitionBy command and providing a number of partitions. By default the partitioner is a hash-based but you can switch to a range-based for a better distribution. If you really want to force a repartitioning you can use a random number generator as the partition function (in PySpark).
```
my_rdd.partitionBy(pCount, partitionFunc = lambda x: np.random.randint(pCount))
```
This will, however, frequently cause inefficient shuffles (lots of data transferred between nodes), but if your process is compute limited then it can make sense.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...