Spark - How to do computation on N partitions and then write to 1 file

前端未结

关注

 2  1358

清酒与你 2021-01-21 15:48

I would like to do a computation on many partitions, to benefit from the parallelism, and then write my results to a single file, probably a parquet file. The workflow I tried i

2条回答

既然无缘 (楼主)

2021-01-21 16:22

You want to use "repartition(1)" instead of "coalesce(1)". The issue is that "repartition" will happily do shuffling to accomplish its ends, while "coalesce" will not.

"Coalesce" is much more efficient than "repartition", but has to be used carefully, or parallelism will end up being severely constrained as you have experienced. All the partitions "coalesce" merges into a particular result partition have to reside on the same node. The "coalesce(1)" call demands a single result partition, so all partitions of "mapped_df" need to reside on a single node. To make that true, Spark shoehorns "mapped_df" into a single partition.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...