How many partitions Spark creates when loading a Hive table

我怕爱的太早我们不能终老 提交于 2020-04-16 02:52:11

问题


Even if it is a Hive table or an HDFS file, when Spark reads the data and creates a dataframe, I was thinking that the number of partitions in the RDD/dataframe will be equal to the number of partfiles in HDFS. But when I did a test with Hive external table, I could see that the number was coming different than the number of part-files .The number of partitions in a dataframe was 119. The table was a Hive partitioned table with 150 partfiles in it, with a minimum size of a file 30 MB and max size is 118 MB. So then what decides the number of partitions?


回答1:


You can control how many bytes Spark packs into a single partition by setting spark.sql.files.maxPartitionBytes. The default value is 128 MB, see Spark Tuning.




回答2:


I think this link does answers my question .The number of partitions depends on the number of splits split and the splits depends on the hadoop inputformat . https://intellipaat.com/community/7671/how-does-spark-partition-ing-work-on-files-in-hdfs




回答3:


With the block size of each block as 128MB. Spark will read the data. Say if your hive table size was aprrox 14.8 GB then it will divide the hive table data into 128 MB blocks and will result in 119 Partitions.

On the other hand your hive table is partitioned so the partition column has 150 unique values.

So number of part files in hive and number of partitions in spark are not linked.



来源:https://stackoverflow.com/questions/60991846/how-many-partitions-spark-creates-when-loading-a-hive-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!