How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?

后端 未结 1 2054
醉话见心
醉话见心 2020-12-08 05:33

This question is same as Number of partitions of a spark dataframe created by reading the data from Hive table

But I think that question did not get a correct answe

相关标签:
1条回答
  • 2020-12-08 06:20

    TL;DR: The default number of partitions when reading data from Hive will be governed by the HDFS blockSize. The number of partitions can be increased by setting mapreduce.job.maps to appropriate value, and can be decreased by setting mapreduce.input.fileinputformat.split.minsize to appropriate value

    Spark SQL creates an instance of HadoopRDD when loading data from a hive table.

    An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the older MapReduce API (org.apache.hadoop.mapred).

    HadoopRDD in turn splits input files according to the computeSplitSize method defined in org.apache.hadoop.mapreduce.lib.input.FileInputFormat (the new API) and org.apache.hadoop.mapred.FileInputFormat (the old API).

    New API:

    protected long computeSplitSize(long blockSize, long minSize,
                                      long maxSize) {
        return Math.max(minSize, Math.min(maxSize, blockSize));
      }
    

    Old API:

    protected long computeSplitSize(long goalSize, long minSize,
                                           long blockSize) {
        return Math.max(minSize, Math.min(goalSize, blockSize));
      }
    

    computeSplitSize splits files according to the HDFS blockSize but if the blockSize is less than minSize or greater than maxSize then it is clamped to those extremes. The HDFS blockSize can be obtained from

    hdfs getconf -confKey dfs.blocksize
    

    According to Hadoop the definitive guide Table 8.5, the minSize is obtained from mapreduce.input.fileinputformat.split.minsize and the maxSize is obtained from mapreduce.input.fileinputformat.split.maxsize.

    However, the book also mentions regarding mapreduce.input.fileinputformat.split.maxsize that:

    This property is not present in the old MapReduce API (with the exception of CombineFileInputFormat). Instead, it is calculated indirectly as the size of the total input for the job, divided by the guide number of map tasks specified by mapreduce.job.maps (or the setNumMapTasks() method on JobConf).

    this post also calculates the maxSize using the total input size divided by the number of map tasks.

    0 讨论(0)
提交回复
热议问题