Spark Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements?

六眼飞鱼酱① 提交于 2019-12-04 06:05:52
zero323

There isn't, because it is highly dependent on application, resources and data. There are some hard limitations (like various 2GB limits) but the rest you have to tune on task to task basis. Some factors to consider:

  • size of a single row / element
  • cost of a typical operation. If have small partitions and operations are cheap then scheduling cost can be much higher than the cost of data processing.
  • cost of processing partition when performing partition-wise (sort for example) operations.

If the core problem here is a number of the initial files then using some variant of CombineFileInputFormat could be a better idea than repartitioning / coalescing. For example:

sc.hadoopFile(
  path,
  classOf[CombineTextInputFormat],
  classOf[LongWritable], classOf[Text]
).map(_._2.toString)

See also How to calculate the best numberOfPartitions for coalesce?

While I'm completely agree with zero323, you still can implement some kind of heuristics. Internally we took size of data stored as avro key-value and compressed and computed number of partitions such that every partition won't be more than 64MB(totalVolume/64MB~number of partitions). Once in a while we run automatic job to recompute "optimal" number of partitions per each type of input etc. In our case it's easy to do since inputs are from hdfs(s3 will work too probably)

Once again it depends on your computation and your data, so your number might be completely different.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!