How to save a DataFrame as compressed (gzipped) CSV?

好久不见. 提交于 2019-12-03 23:37:41

问题


I use Spark 1.6.0 and Scala.

I want to save a DataFrame as compressed CSV format.

Here is what I have so far (assume I already have df and sc as SparkContext):

//set the conf to the codec I want
sc.getConf.set("spark.hadoop.mapred.output.compress", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
sc.getConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

df.write
  .format("com.databricks.spark.csv")
  .save(my_directory)

The output is not in gz format.


回答1:


On the spark-csv github: https://github.com/databricks/spark-csv

One can read:

codec: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.

In your case, this should work: df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')




回答2:


This code works for Spark 2.1, where .codec is not available.

df.write
  .format("com.databricks.spark.csv")
  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
  .save(my_directory)

For Spark 2.2, you can use the df.write.csv(...,codec="gzip") option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec




回答3:


With Spark 2.0+, this has become a bit simpler:

df.write.csv("path", compression="gzip")

You don't need the external Databricks CSV package anymore.

The csv() writer supports a number of handy options. For example:

  • sep: To set the separator character.
  • quote: Whether and how to quote values.
  • header: Whether to include a header line.

There are also a number of other compression codecs you can use, in addition to gzip:

  • bzip2
  • lz4
  • snappy
  • deflate

The full Spark docs for the csv() writer are here: Python / Scala




回答4:


To write the CSV file with headers and rename the part-000 file to .csv.gzip

DF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
.option("header","true")
.option("codec",org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)

copyRename(tempLocationFileName, finalLocationFileName)

def copyRename(srcPath: String, dstPath: String): Unit =  {
  val hadoopConfig = new Configuration()
  val hdfs = FileSystem.get(hadoopConfig)
  FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
  // the "true" setting deletes the source files once they are merged into the new output
}

If you don't need the header then set it to false and you wouldn't need to do the coalesce either. It will be faster to write too.



来源:https://stackoverflow.com/questions/40163996/how-to-save-a-dataframe-as-compressed-gzipped-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!