spark.sql.hive.filesourcePartitionFileCacheSize

自古美人都是妖i 提交于 2021-02-08 08:20:31

问题


Just wonder if anyone is aware of this warning info

18/01/10 19:52:56 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance

I've seen this a lot when trying to load some big dataframe with many partitions from S3 into spark.

It never really causes any issues to the job, just wonder what is the use of that config property and how to tune it properly.

Thanks


回答1:


In answer to your question, this is a spark-hive specific config property which, when nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.

In spark source code it is written like the following. The default size is 250 * 1024 * 1024 as per code which you can try to manipulate by your SparkConf object in your code/in spark-submit command.

Spark Source Code

val HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE =
    buildConf("spark.sql.hive.filesourcePartitionFileCacheSize")
      .doc("When nonzero, enable caching of partition file metadata in memory. All tables share " +
           "a cache that can use up to specified num bytes for file metadata. This conf only " +
           "has an effect when hive filesource partition management is enabled.")
      .longConf
      .createWithDefault(250 * 1024 * 1024)


来源:https://stackoverflow.com/questions/48195147/spark-sql-hive-filesourcepartitionfilecachesize

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!