Spark - How to write a single csv file WITHOUT folder?

人盡茶涼 提交于 2019-11-30 06:48:39
Paul Vbl

A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:

df.toPandas().to_csv("<path>/<filename>")
Ravi

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).

 1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
    "true").csv("PATH/FOLDER_NAME/x.csv")  



2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
        "true").csv("PATH/FOLDER_NAME/x.csv")

If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.

You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.

df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")

you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!