Does Spark support Partition Pruning with Parquet Files

∥☆過路亽.° 提交于 2019-11-27 22:05:14

Yes, spark supports partition pruning.

Spark does a listing of partitions directories (sequential or parallel listLeafFilesInParallel) to build a cache of all partitions first time around. The queries in the same application, that scan data takes advantage of this cache. So the slowness that you see could be because of this cache building. The subsequent queries that scan data make use of the cache to prune partitions.

These are the logs which shows partitions being listed to populate the cache.

App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-01 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-02 on driver
App > 16/11/14 10:45:24 main INFO ParquetRelation: Listing s3://test-bucket/test_parquet_pruning/month=2015-03 on driver

These are the logs showing pruning is happening.

App > 16/11/10 12:29:16 main INFO DataSourceStrategy: Selected 1 partitions out of 20, pruned 95.0% partitions.

Refer convertToParquetRelation and getHiveQlPartitions in HiveMetastoreCatalog.scala.

Marco99

Just a thought:

Spark API documentation for HadoopFsRelation says, ( https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/sources/HadoopFsRelation.html )

"...when reading from Hive style partitioned tables stored in file systems, it's able to discover partitioning information from the paths of input directories, and perform partition pruning before start reading the data..."

So, i guess "listLeafFilesInParallel" could not be a problem.

A similar issue is already in spark jira: https://issues.apache.org/jira/browse/SPARK-10673

In spite of "spark.sql.hive.verifyPartitionPath" set to false and, there is no effect in performance, I suspect that the issue might have been caused by unregistered partitions. Please list out the partitions of the table and verify if all the partitions are registered. Else, recover your partitions as shown in this link:

Hive doesn't read partitioned parquet files generated by Spark

Update:

  1. I guess appropriate parquet block size and page size were set while writing the data.

  2. Create a fresh hive table with partitions mentioned, and file-format as parquet, load it from non-partitioned table using dynamic partition approach. ( https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions ) Run a plain hive query and then compare by running a spark program.

Disclaimer: I am not a spark/parquet expert. The problem sounded interesting, and hence responded.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!