Partitions not being pruned in simple SparkSQL queries

匿名 (未验证) 提交于 2019-12-03 08:44:33

问题:

I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions.

Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore:

# Make some data df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5,                         'k': ['a', 'e', 'i', 'o', 'u']*3,                         'v': range(15)}) # Convert to a SparkSQL DataFrame sdf = hiveContext.createDataFrame(df) # And save it sdf.write.partitionBy('pk').saveAsTable('dataset',                                         format='parquet',                                         path='s3a://bucket/dataset')

In a subsequent session, I want to select a subset of this table:

dataset = hiveContext.table('dataset') filtered_dataset = dataset.filter(dataset.pk == 'b') print filtered_dataset.toPandas()

In the subsequent logs being printed, I see that pruning is supposed to be happening:

15/07/05 02:39:39 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned -200.0% partitions.

But then I see parquet files being opened from all partitions:

15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset 508 15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 508 15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset 509 15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 509 15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset 262 15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 262 15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset 509 15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 509 15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset -365 15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 152 15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset -365 15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 151 15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset -266 15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 4 15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset -365 15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 152

With only three partitions, this isn't a problem---but with thousands, it causes noticeable delay. Why are all these irrelevant files being opened?

回答1:

Take a look at spark.sql.parquet.filterPushdown, by default set to false because of some bugs present in the Parquet version that Spark uses. It may be possible to use in 1.3/1.4, check the official documentation.

I think this is fixed in Spark 1.5.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!