Spark : how can evenly distribute my records in all partition

前端 未结 2 1301
难免孤独
难免孤独 2021-01-06 14:16

I have a RDD with 30 record (key/value pair : key is Time Stamp and Value is JPEG Byte Array)
and I am running 30 executors. I want to repartition this RDD in to 30 pa

2条回答
  •  Happy的楠姐
    2021-01-06 14:36

    You can force a new partitioning by using the partitionBy command and providing a number of partitions. By default the partitioner is a hash-based but you can switch to a range-based for a better distribution. If you really want to force a repartitioning you can use a random number generator as the partition function (in PySpark).

    my_rdd.partitionBy(pCount, partitionFunc = lambda x: np.random.randint(pCount))
    

    This will, however, frequently cause inefficient shuffles (lots of data transferred between nodes), but if your process is compute limited then it can make sense.

提交回复
热议问题