Spark : how can evenly distribute my records in all partition

前端未结

关注

 2  1293

I have a RDD with 30 record (key/value pair : key is Time Stamp and Value is JPEG Byte Array)
and I am running 30 executors. I want to repartition this RDD in to 30 pa

相关标签:

2条回答

野的像风

2021-01-06 14:30

Salting technique can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.

(here is link for salting)

0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2021-01-06 14:36
You can force a new partitioning by using the partitionBy command and providing a number of partitions. By default the partitioner is a hash-based but you can switch to a range-based for a better distribution. If you really want to force a repartitioning you can use a random number generator as the partition function (in PySpark).
```
my_rdd.partitionBy(pCount, partitionFunc = lambda x: np.random.randint(pCount))
```
This will, however, frequently cause inefficient shuffles (lots of data transferred between nodes), but if your process is compute limited then it can make sense.
0 讨论(0)
发布评论:

提交评论
- 加载中...