How many partitions does Spark create when a file is loaded from S3 bucket?

末鹿安然 提交于 2019-12-04 20:47:29

问题


If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket?


回答1:


See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits().

Block size depends on S3 file system implementation (see FileStatus.getBlockSize()). E.g. S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play).

Also, you don't get splits if your InputFormat is not splittable :)




回答2:


Spark will treat S3 as if it were a block-based filesystem, so partitioning rules for HDFS and S3 inputs are the same: by default you will get one partition per one block. It is worth inspecting number of created partitions yourself:

val inputRDD = sc.textFile("s3a://...")
println(inputRDD.partitions.length)

For further reading I suggest this, which covers partitioning rules in detail.



来源:https://stackoverflow.com/questions/37168716/how-many-partitions-does-spark-create-when-a-file-is-loaded-from-s3-bucket

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!