I have a RDD with 30 record (key/value pair : key is Time Stamp and Value is JPEG Byte Array)
and I am running 30 executors. I want to repartition this RDD in to 30 pa
Salting technique can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.
(here is link for salting)
You can force a new partitioning by using the partitionBy
command and providing a number of partitions. By default the partitioner is a hash-based but you can switch to a range-based for a better distribution. If you really want to force a repartitioning you can use a random number generator as the partition function (in PySpark).
my_rdd.partitionBy(pCount, partitionFunc = lambda x: np.random.randint(pCount))
This will, however, frequently cause inefficient shuffles (lots of data transferred between nodes), but if your process is compute limited then it can make sense.