How to export a table dataframe in PySpark to csv?

三世轮回 提交于 2019-11-27 00:40:01

If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:

df.toPandas().to_csv('mycsv.csv')

Otherwise you can use spark-csv:

  • Spark 1.3

    df.save('mycsv.csv', 'com.databricks.spark.csv')
    
  • Spark 1.4+

    df.write.format('com.databricks.spark.csv').save('mycsv.csv')
    

In Spark 2.0+ you can use csv data source directly:

df.write.csv('mycsv.csv')

For Apache Spark 2+, in order to save dataframe into single csv file. Use following command

query.repartition(1).write.csv("cc_out.csv", sep='|')

Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.

If you cannot use spark-csv, you can do the following:

df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")

If you need to handle strings with linebreaks or comma that will not work. Use this:

import csv
import cStringIO

def row2csv(row):
    buffer = cStringIO.StringIO()
    writer = csv.writer(buffer)
    writer.writerow([str(s).encode("utf-8") for s in row])
    buffer.seek(0)
    return buffer.read().strip()

df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")

You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,

df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')

Read more about the repartition function Read more about the save function

However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.

Read more on repartition vs coalesce functions.

How about this (in you don't want an one liner) ?

for row in df.collect():
    d = row.asDict()
    s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
    f.write(s)

f is a opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!