Spark - How to do computation on N partitions and then write to 1 file

前端 未结 2 1358
清酒与你
清酒与你 2021-01-21 15:48

I would like to do a computation on many partitions, to benefit from the parallelism, and then write my results to a single file, probably a parquet file. The workflow I tried i

2条回答
  •  既然无缘
    2021-01-21 16:22

    You want to use "repartition(1)" instead of "coalesce(1)". The issue is that "repartition" will happily do shuffling to accomplish its ends, while "coalesce" will not.

    "Coalesce" is much more efficient than "repartition", but has to be used carefully, or parallelism will end up being severely constrained as you have experienced. All the partitions "coalesce" merges into a particular result partition have to reside on the same node. The "coalesce(1)" call demands a single result partition, so all partitions of "mapped_df" need to reside on a single node. To make that true, Spark shoehorns "mapped_df" into a single partition.

提交回复
热议问题