How to split parquet files into many partitions in Spark?

后端 未结 5 881
萌比男神i
萌比男神i 2020-12-06 05:10

So I have just 1 parquet file I\'m reading with Spark (using the SQL stuff) and I\'d like it to be processed with 100 partitions. I\'ve tried setting spark.default.pa

相关标签:
5条回答
  • 2020-12-06 05:40

    Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it

    val k = sc.parquetFile("the-big-table.parquet")
    k.partitions.length
    

    You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)

    0 讨论(0)
  • 2020-12-06 05:45

    You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.

    The source of ParquetOuputFormat is here, if you want to dig into details.

    The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

    0 讨论(0)
  • 2020-12-06 05:46

    You have mentioned that you want to control distribution during write to parquet. When you create parquet from RDDs parquet preserves partitions of the RDD. So, if you create RDD and specify 100 partitions and from dataframe with parquet format then it will be writing 100 separate parquet files to fs. For read you could specify spark.sql.shuffle.partitions parameter.

    0 讨论(0)
  • 2020-12-06 05:57

    To achieve that you should use SparkContext to set Hadoop configuration (sc.hadoopConfiguration) property mapreduce.input.fileinputformat.split.maxsize.

    By setting this property to a lower value than hdfs.blockSize, than you will get as much partitions as the number of splits.

    For example:
    When hdfs.blockSize = 134217728 (128MB),
    and one file is read which contains exactly one full block,
    and mapreduce.input.fileinputformat.split.maxsize = 67108864 (64MB)

    Then there will be two partitions those splits will be read into.

    0 讨论(0)
  • 2020-12-06 05:58

    The new way of doing it (Spark 2.x) is setting

    spark.sql.files.maxPartitionBytes
    

    Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)

    From my experience, Hadoop settings no longer have effect.

    0 讨论(0)
提交回复
热议问题