How many partitions Spark creates when loading a Hive table

问题

Even if it is a Hive table or an HDFS file, when Spark reads the data and creates a dataframe, I was thinking that the number of partitions in the RDD/dataframe will be equal to the number of partfiles in HDFS. But when I did a test with Hive external table, I could see that the number was coming different than the number of part-files .The number of partitions in a dataframe was 119. The table was a Hive partitioned table with 150 partfiles in it, with a minimum size of a file 30 MB and max size is 118 MB. So then what decides the number of partitions?

回答1:

You can control how many bytes Spark packs into a single partition by setting spark.sql.files.maxPartitionBytes. The default value is 128 MB, see Spark Tuning.

回答2:

I think this link does answers my question .The number of partitions depends on the number of splits split and the splits depends on the hadoop inputformat . https://intellipaat.com/community/7671/how-does-spark-partition-ing-work-on-files-in-hdfs

回答3:

With the block size of each block as 128MB. Spark will read the data. Say if your hive table size was aprrox 14.8 GB then it will divide the hive table data into 128 MB blocks and will result in 119 Partitions.

On the other hand your hive table is partitioned so the partition column has 150 unique values.

So number of part files in hive and number of partitions in spark are not linked.

来源：https://stackoverflow.com/questions/60991846/how-many-partitions-spark-creates-when-loading-a-hive-table

标签

apache-spark

Hadoop

pyspark

apache-spark-sql