Spark How to Specify Number of Resulting Files for DataFrame While/After Writing

问题

I saw several q/a's about writing single file into hdfs,it seems using coalesce(1) is sufficient.

E.g;

df.coalesce(1).write.mode("overwrite").format(format).save(location)

But how can I specify "exact" number of files that will written after save operation?

So my question is;

If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?

If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?

Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?

Thanks

回答1:

Number of output files is in general equal to the number of writing tasks (partitions). Under normal conditions It cannot be smaller (each writer writes its own part and multiple tasks cannot write to the same file), but can be larger if format has non-standard behavior or partitionBy is used.

Normally

If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?

Yes

If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?

And yes.

Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?

No.

来源：https://stackoverflow.com/questions/51098198/spark-how-to-specify-number-of-resulting-files-for-dataframe-while-after-writing

标签

scala

apache-spark

dataframe

HDFS

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!