发表新帖

发表新帖

How to save a spark DataFrame as csv on disk?

后端未结

关注

 4  2024

你的背包 2020-11-29 03:05

For example, the result of this:

df.filter(\"project = \'en\'\").select(\"title\",\"count\").groupBy(\"title\").sum()

would return an Array

4条回答

無奈伤痛 (楼主)

2020-11-29 03:42
Apache Spark does not support native CSV output on disk.

You have four available solutions though:
1. You can convert your Dataframe into an RDD :
```
def convertToReadableString(r : Row) = ???
df.rdd.map{ convertToReadableString }.saveAsTextFile(filepath)
```
  This will create a folder filepath. Under the file path, you'll find partitions files (e.g part-000*)
  
  What I usually do if I want to append all the partitions into a big CSV is
```
cat filePath/part* > mycsvfile.csv
```
  Some will use coalesce(1,false) to create one partition from the RDD. It's usually a bad practice, since it may overwhelm the driver by pulling all the data you are collecting to it.
  
  Note that df.rdd will return an RDD[Row].
2. With Spark <2, you can use databricks spark-csv library:
  - Spark 1.4+:
```
df.write.format("com.databricks.spark.csv").save(filepath)
```
  - Spark 1.3:
```
df.save(filepath,"com.databricks.spark.csv")
```
3. With Spark 2.x the spark-csv package is not needed as it's included in Spark.
```
df.write.format("csv").save(filepath)
```
4. You can convert to local Pandas data frame and use to_csv method (PySpark only).
Note: Solutions 1, 2 and 3 will result in CSV format files (part-*) generated by the underlying Hadoop API that Spark calls when you invoke save. You will have one part- file per partition.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题