Does Spark support true column scans over parquet files in S3?

前端未结

关注

 4  1125

陌清茗 2020-12-13 06:45

One of the great benefits of the Parquet data storage format is that it\'s columnar. If I\'ve got a \'wide\' dataset with hundreds of columns, but my query only touches a f

4条回答

夕颜 (楼主)

2020-12-13 07:21
No, predicate pushdown is not fully supported. This, of course, depends on:
- Specific use case
- Spark version
- S3 connector type and version
In order to check your specific use case, you can enable DEBUG log level in Spark, and run your query. Then, you can see whether there are "seeks" during S3 (HTTP) requests as well as how many requests to were actually sent. Something like this:

17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET /test/part-00000-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1[\r][\n]" .... 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 0-7472093/7472094[\r][\n]" .... 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 7472094[\r][\n]"

Here's example of an issue report that was opened recently due to inability of Spark 2.1 to calculate COUNT(*) of all the rows in a dataset based on metadata stored in Parquet file: https://issues.apache.org/jira/browse/SPARK-21074
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...