问题
Just wonder if anyone is aware of this warning info
18/01/10 19:52:56 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance
I've seen this a lot when trying to load some big dataframe with many partitions from S3 into spark.
It never really causes any issues to the job, just wonder what is the use of that config property and how to tune it properly.
Thanks
回答1:
In answer to your question, this is a spark-hive specific config property which, when nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.
In spark source code it is written like the following. The default size is 250 * 1024 * 1024 as per code which you can try to manipulate by your SparkConf object in your code/in spark-submit command.
Spark Source Code
val HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE =
buildConf("spark.sql.hive.filesourcePartitionFileCacheSize")
.doc("When nonzero, enable caching of partition file metadata in memory. All tables share " +
"a cache that can use up to specified num bytes for file metadata. This conf only " +
"has an effect when hive filesource partition management is enabled.")
.longConf
.createWithDefault(250 * 1024 * 1024)
来源:https://stackoverflow.com/questions/48195147/spark-sql-hive-filesourcepartitionfilecachesize