How do I enable partition pruning in spark

假如想象 提交于 2019-12-05 02:19:41

Spark needs to load the partition metdata first in the driver to know whether the partition exists or not. Spark will query the directory to find existing partitions to know if it can prune the partition or not during the scanning of the data.

I've tested this on Spark 2.0 and you can see in the log messages.

16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year on driver
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year/year=2007 on driver

This doesn't mean that we're scaning the files in each partition, but Spark will store the locations of the partitions for future queries on the table.

You can see the logs that it is actually passing in partition filters to prune the data:

16/10/14 17:23:48 TRACE ListingFileCatalog: Partition spec: PartitionSpec(StructType(StructField(year,IntegerType,true)),ArrayBuffer(PartitionDirectory([2012],s3a://mybucket/reddit_year/year=2012), PartitionDirectory([2010],s3a://mybucket/reddit_year/year=2010), ...PartitionDirectory([2015],s3a://mybucket/reddit_year/year=2015), PartitionDirectory([2011],s3a://mybucket/reddit_year/year=2011)))
16/10/14 17:23:48 INFO ListingFileCatalog: Selected 1 partitions out of 9, pruned 88.88888888888889% partitions.

You can see this in the logical plan if you run an explain(True) on your query: spark.sql("select created_utc, score, name from reddit where year = '2014'").explain(True)

This will show you the plan and you can see that it is filtering at the bottom of the plan:

+- *BatchedScan parquet [created_utc#58,name#65,score#69L,year#74] Format: ParquetFormat, InputPaths: s3a://mybucket/reddit_year, PartitionFilters: [isnotnull(year#74), (cast(year#74 as double) = 2014.0)], PushedFilters: [], ReadSchema: struct<created_utc:string,name:string,score:bigint>

Spark has opportunities to improve its partition pruning when going via Hive; see SPARK-17179.

If you are just going direct to the object store, then the problem is that recursive directory operations against object stores are real performance killers. My colleagues and I have done work in the S3A client there HADOOP-11694 —and now need to follow it up with the changes to Spark to adopt the specific API calls we've been able to fix. For that though we need to make sure we are working with real datasets with real-word layouts, so don't optimise for specific examples/benchmarks.

For now, people should chose partition layouts which have shallow directory trees.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!