Spark - How to write a single csv file WITHOUT folder?

限于喜欢 提交于 2019-12-18 12:12:02

问题


Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is

df.coalesce(1).write.option("header", "true").csv("name.csv")

This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.

I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).

Any help is appreciated.


回答1:


A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:

df.toPandas().to_csv("<path>/<filename>")



回答2:


There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.

Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).

 1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
    "true").csv("PATH/FOLDER_NAME/x.csv")  



2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
        "true").csv("PATH/FOLDER_NAME/x.csv")

If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.

You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.




回答3:


I had the same problem and used python's NamedTemporaryFile library to solve this.

from tempfile import NamedTemporaryFile

s3 = boto3.resource('s3')

with NamedTemporaryFile() as tmp:
    df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
    s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()




回答4:


df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")

you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work



来源:https://stackoverflow.com/questions/43661660/spark-how-to-write-a-single-csv-file-without-folder

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!