How to split parquet files into many partitions in Spark?

后端 未结 5 880
萌比男神i
萌比男神i 2020-12-06 05:10

So I have just 1 parquet file I\'m reading with Spark (using the SQL stuff) and I\'d like it to be processed with 100 partitions. I\'ve tried setting spark.default.pa

5条回答
  •  攒了一身酷
    2020-12-06 05:46

    You have mentioned that you want to control distribution during write to parquet. When you create parquet from RDDs parquet preserves partitions of the RDD. So, if you create RDD and specify 100 partitions and from dataframe with parquet format then it will be writing 100 separate parquet files to fs. For read you could specify spark.sql.shuffle.partitions parameter.

提交回复
热议问题