I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take
I might be a little late to the game here, but using coalesce(1)
or repartition(1)
may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.
I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.
EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce()
would be fine if a single executor has more RAM for use than the driver.
EDIT 2: copyMerge()
is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: How to do CopyMerge in Hadoop 3.0?