Spark saveAsTextFile() writes to multiple files instead of one [duplicate]

馋奶兔 提交于 2020-12-29 04:27:24

问题


I am using Spark and Scala on my laptop at this moment.

When I write an RDD to a file, the output is written to two files "part-00000" and "part-00001". How can I force Spark / Scala to write to one file?

My code is currently:

myRDD.map(x => x._1 + "," + x._2).saveAsTextFile("/path/to/output")

where I am removing the parentheses to write out key,value pairs.


回答1:


The "problem" is indeed a feature, and it is produced by how your RDD is partitioned, hence it is separated in n parts where n is the number of partitions. To fix this you just need to change the number of partitions to one, by using repartition on your RDD. The documentation states:

repartition(numPartitions)

Return a new RDD that has exactly numPartitions partitions.

Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

For example, this change should work.

myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")

As the documentation says you can also use coalesce, which is actually the recommended option when you are reducing the number of partitions. However, reducing the number of partitions to one is considered a bad idea, because it causes shuffling of the data to one node and loss of parallelism.



来源:https://stackoverflow.com/questions/35445486/spark-saveastextfile-writes-to-multiple-files-instead-of-one

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!