Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles

徘徊边缘 提交于 2019-12-06 13:46:59

It would have been more clear if we have size of each file. But code will not be wrong. I am adding this answer as per spark code base

  • First off all, maxSplitSize will be calculated depends directory size and min partitions passed in wholeTextFiles

        def setMinPartitions(context: JobContext, minPartitions: Int) {
          val files = listStatus(context).asScala
          val totalLen = files.map(file => if (file.isDirectory) 0L else file.getLen).sum
          val maxSplitSize = Math.ceil(totalLen * 1.0 /
            (if (minPartitions == 0) 1 else minPartitions)).toLong
          super.setMaxSplitSize(maxSplitSize)
        }
        // file: WholeTextFileInputFormat.scala
    

    link

  • As per maxSplitSize splits(partitions in Spark) will be extracted from source.

        inputFormat.setMinPartitions(jobContext, minPartitions)
        val rawSplits = inputFormat.getSplits(jobContext).toArray // Here number of splits will be decides
        val result = new Array[Partition](rawSplits.size)
        for (i <- 0 until rawSplits.size) {
          result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
        }
        // file: WholeTextFileRDD.scala
    

    link

More information available at CombineFileInputFormat#getSplits class on reading files and preparing splits.

Note:

I referred Spark partitions as MapReduce splits here, as Spark borrowed input and output formatters from MapReduce

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!