Does Spark support true column scans over parquet files in S3?

前端 未结 4 1125
陌清茗
陌清茗 2020-12-13 06:45

One of the great benefits of the Parquet data storage format is that it\'s columnar. If I\'ve got a \'wide\' dataset with hundreds of columns, but my query only touches a f

4条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-13 07:21

    No, predicate pushdown is not fully supported. This, of course, depends on:

    • Specific use case
    • Spark version
    • S3 connector type and version

    In order to check your specific use case, you can enable DEBUG log level in Spark, and run your query. Then, you can see whether there are "seeks" during S3 (HTTP) requests as well as how many requests to were actually sent. Something like this:

    17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET /test/part-00000-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1[\r][\n]" .... 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 0-7472093/7472094[\r][\n]" .... 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 7472094[\r][\n]"

    Here's example of an issue report that was opened recently due to inability of Spark 2.1 to calculate COUNT(*) of all the rows in a dataset based on metadata stored in Parquet file: https://issues.apache.org/jira/browse/SPARK-21074

提交回复
热议问题