发表新帖

发表新帖

Write single CSV file using spark-csv

前端未结

关注

 13  2076

心在旅途 2020-11-22 08:43

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

Need a Scala function which will take

13条回答

执念已碎 (楼主)

2020-11-22 09:23

I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.

I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.

EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.

EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?

0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...

热议问题