Number of partitions in RDD and performance in Spark

痞子三分冷 提交于 2019-12-17 06:34:43

问题


In Pyspark, I can create a RDD from a list and decide how many partitions to have:

sc = SparkContext()
sc.parallelize(xrange(0, 10), 4)

How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has?


回答1:


The primary effect would be by specifying too few partitions or far too many partitions.

Too few partitions You will not utilize all of the cores available in the cluster.

Too many partitions There will be excessive overhead in managing many small tasks.

Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.




回答2:


To add to javadba's excellent answer, I recall the docs recommend to have your number of partitions set to 3 or 4 times the number of CPU cores in your cluster so that the work gets distributed more evenly among the available CPU cores. Meaning, if you only have 1 partition per CPU core in the cluster you will have to wait for the one longest running task to complete but if you had broken that down further the workload would be more evenly balanced with fast and slow running tasks evening out.




回答3:


number of partition have high impact on spark code performance. Ideally the spark partition implies how much data you want to shuffle. Normally you should set this parameter on your shuffle size(shuffle read write) and than you can decide and number of partition to 128 to 256 MB per partition to gain maximum performance.

You can set partition in your spark sql code by setting the property as: spark.sql.shuffle.partitions or while using any dataframe you can set this by belo: df.repartition(numOfPartitions)



来源:https://stackoverflow.com/questions/35800795/number-of-partitions-in-rdd-and-performance-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!