Spark + Hive : Number of partitions scanned exceeds limit (=4000)

有些话、适合烂在心里 提交于 2021-02-07 11:03:50

问题


We upgraded our Hadoop Platform (Spark; 2.3.0, Hive: 3.1), and I'm facing this exception when reading some Hive tables in Spark : "Number of partitions scanned on table 'my_table' exceeds limit (=4000)".

Tables we are working on :
table1 : external table with a total of ~12300 partitions, partitioned by(col1: String, date1: String) , (ORC compressed ZLIB)
table2 : external table with a total of 4585 partitions, partitioned by(col21: String, date2: Date, col22: String) (ORC uncompressed)

[A] Knowing that we had set this spark conf:
--conf "spark.hadoop.metastore.catalog.default=hive"
We execute in spark :
[1] spark.sql("select * from table1 where col1 = 'value1' and date1 = '2020-06-03'").count
=> Error : Number of partitions scanned (=12300) on table 'table1' exceeds limit (=4000)
[2] spark.sql("select * from table2 where col21 = 'value21' and col22 = 'value22'").count
[3] spark.sql("select * from table2 where col21 = 'value21' and date2 = '2020-06-03' and col22 = 'value22'").count
=> Error on [2] and [3] : Number of partitions scanned (=4585) on table 'table2' exceeds limit (=4000)

[B] We solved the problem by adding this spark conf :

--conf "spark.sql.hive.convertMetastoreOrc=false" 

resulting in automatically activating : --conf "spark.sql.hive.metastorePartitionPruning=true"

Re-executing in spark :
[1] and [2] => Success
[3] => Error : Number of partitions scanned (=4585) on table 'table2' exceeds limit (=4000)

[C] To solve the error on [3], we set

--conf "spark.sql.hive.convertMetastoreOrc=false" 
--conf "spark.sql.hive.metastorePartitionPruning=false"

Re-executing in spark :
[3] => Success
on the other hand, if we recall [1] : performance is degraded, it takes so mush time to execute and we don't want that..

In conclusion:
In case [B], we think that a partition can not be of a type "Date", when it's a String it is OK.
But why ? What's going on ? Aren't we supposed to have a partition of other types than String type when partition Pruning is activated ?
Why does it work in case [C] ? and how could we solve case [B][3] without degrading performance of [1] ?

Hoping that it's clear, please let me know if you need other information!

Thank you if you can help or advise !

来源:https://stackoverflow.com/questions/62180078/spark-hive-number-of-partitions-scanned-exceeds-limit-4000

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!