Write single CSV file using spark-csv

匿名 (未验证) 提交于 2019-12-03 01:38:01

问题:

I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.

Need a Scala function which will take parameter like path and file name and write that CSV file.

回答1:

It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle):

df    .repartition(1)    .write.format("com.databricks.spark.csv")    .option("header", "true")    .save("mydata.csv") 

or coalesce:

df    .coalesce(1)    .write.format("com.databricks.spark.csv")    .option("header", "true")    .save("mydata.csv") 

data frame before saving:

All data will be written to mydata.csv/part-00000. Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker. If you use distributed file system with replication, data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes.

Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards.



回答2:

I might be a little late to the game here, but using coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, process slowly.

I would highly suggest that you use the FileUtil.copyMerge() function from the Hadoop API. This will merge the outputs into a single file.

EDIT - This effectively brings the data to the driver rather than an executor node. Coalesce() would be fine if a single executor has more RAM for use than the driver.

EDIT 2: copyMerge() is being removed in Hadoop 3.0. See the following stack overflow article for more information on how to work with the newest version: Hadoop how to do CopyMerge in Hadoop 3.0



回答3:

If you are running Spark with HDFS, I've been solving the problem by writing csv files normally and leveraging HDFS to do the merging. I'm doing that in Spark (1.6) directly:

import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs._  def merge(srcPath: String, dstPath: String): Unit =  {    val hadoopConfig = new Configuration()    val hdfs = FileSystem.get(hadoopConfig)    FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)     // the "true" setting deletes the source files once they are merged into the new output }   val newData = >   val outputfile = "/user/feeds/project/outputs/subject"   var filename = "myinsights" var outputFileName = outputfile + "/temp_" + filename  var mergedFileName = outputfile + "/merged_" + filename var mergeFindGlob  = outputFileName      newData.write         .format("com.databricks.spark.csv")         .option("header", "false")         .mode("overwrite")         .save(outputFileName)     merge(mergeFindGlob, mergedFileName )     newData.unpersist() 

Can't remember where I learned this trick, but it might work for you.



回答4:

If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:

val fileprefix= "/mnt/aws/path/file-prefix"  dataset   .coalesce(1)          .write              //.mode("overwrite") // I usually don't use this, but you may want to.   .option("header", "true")   .option("delimiter","\t")   .csv(fileprefix+".tmp")  val partition_path = dbutils.fs.ls(fileprefix+".tmp/")      .filter(file=>file.name.endsWith(".csv"))(0).path  dbutils.fs.cp(partition_path,fileprefix+".tab")  dbutils.fs.rm(fileprefix+".tmp",recurse=true) 

If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.

This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.

The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.



回答5:

repartition/coalesce to 1 partition before you save (you'd still get a folder but it would have one part file in it)



回答6:

you can use rdd.coalesce(1, true).saveAsTextFile(path)

it will store data as singile file in path/part-00000



回答7:

There is one more way to use Java

import java.io._  def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit)    {      val p = new java.io.PrintWriter(f);        try { op(p) }       finally { p.close() }   }   printToFile(new File("C:/TEMP/df.csv")) { p => df.collect().foreach(p.println)} 


回答8:

You can try doing this

df.coalesce(1).     write.option("header","true").     csv("/path/new_folder") 

The csv file will be created inside the new_folder



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!