How does partitioning work in Spark?

主宰稳场 提交于 2019-11-26 23:02:16

问题


I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please?

Here is the scenario:

  • a master and two nodes with 1 core each
  • a file count.txt of 10 MB in size

How many partitions does the following create?

rdd = sc.textFile(count.txt)

Does the size of the file have any impact on the number of partitions?


回答1:


By default a partition is created for each HDFS partition, which by default is 64MB (from the Spark Programming Guide).

It's possible to pass another parameter defaultMinPartitions which overrides the minimum number of partitions that spark will create. If you don't override this value then spark will create at least as many partitions as spark.default.parallelism.

Since spark.default.parallelism is supposed to be the number of cores across all of the machines in your cluster I believe that there would be at least 3 partitions created in your case.

You can also repartition or coalesce an RDD to change the number of partitions that in turn influences the total amount of available parallelism.



来源:https://stackoverflow.com/questions/26368362/how-does-partitioning-work-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!