How to split parquet files into many partitions in Spark?

后端 未结 5 894
萌比男神i
萌比男神i 2020-12-06 05:10

So I have just 1 parquet file I\'m reading with Spark (using the SQL stuff) and I\'d like it to be processed with 100 partitions. I\'ve tried setting spark.default.pa

5条回答
  •  情书的邮戳
    2020-12-06 05:45

    You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.

    The source of ParquetOuputFormat is here, if you want to dig into details.

    The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

提交回复
热议问题