Write single CSV file using spark-csv

前端 未结 13 2076
心在旅途
心在旅途 2020-11-22 08:43

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

Need a Scala function which will take

13条回答
  •  执念已碎
    2020-11-22 09:23

    I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.

    I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.

    EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.

    EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?

提交回复
热议问题