Spark : how can evenly distribute my records in all partition

前端 未结 2 1293
难免孤独
难免孤独 2021-01-06 14:16

I have a RDD with 30 record (key/value pair : key is Time Stamp and Value is JPEG Byte Array)
and I am running 30 executors. I want to repartition this RDD in to 30 pa

相关标签:
2条回答
  • 2021-01-06 14:30

    Salting technique can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.

    (here is link for salting)

    0 讨论(0)
  • 2021-01-06 14:36

    You can force a new partitioning by using the partitionBy command and providing a number of partitions. By default the partitioner is a hash-based but you can switch to a range-based for a better distribution. If you really want to force a repartitioning you can use a random number generator as the partition function (in PySpark).

    my_rdd.partitionBy(pCount, partitionFunc = lambda x: np.random.randint(pCount))
    

    This will, however, frequently cause inefficient shuffles (lots of data transferred between nodes), but if your process is compute limited then it can make sense.

    0 讨论(0)
提交回复
热议问题