How to save a DataFrame as compressed (gzipped) CSV?

前端 未结 4 1940
感情败类
感情败类 2020-12-30 23:09

I use Spark 1.6.0 and Scala.

I want to save a DataFrame as compressed CSV format.

Here is what I have so far (assume I already have df and

4条回答
  •  無奈伤痛
    2020-12-30 23:47

    To write the CSV file with headers and rename the part-000 file to .csv.gzip

    DF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
    .option("header","true")
    .option("codec",org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)
    
    copyRename(tempLocationFileName, finalLocationFileName)
    
    def copyRename(srcPath: String, dstPath: String): Unit =  {
      val hadoopConfig = new Configuration()
      val hdfs = FileSystem.get(hadoopConfig)
      FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
      // the "true" setting deletes the source files once they are merged into the new output
    }
    

    If you don't need the header then set it to false and you wouldn't need to do the coalesce either. It will be faster to write too.

提交回复
热议问题