How to write standard CSV

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-11 16:30:36

问题


It is very simple to read a standard CSV file, for example:

 val t = spark.read.format("csv")
 .option("inferSchema", "true")
 .option("header", "true")
 .load("file:///home/xyz/user/t.csv")

It reads a real CSV file, something as

   fieldName1,fieldName2,fieldName3
   aaa,bbb,ccc
   zzz,yyy,xxx

and t.show produced the expected result.

I need the inverse, to write standard CSV file (not a directory of non-standard files).

It is very frustrating not to see the inverse result when write is used. Maybe some other option or some kind of format (" REAL csv please! ") exists.


NOTES

I am using Spark v2.2 and running tests on Spark-shell.

The "syntatical inverse" of read is write, so is expected to produce same file format with it. But the result of

   t.write.format("csv").option("header", "true").save("file:///home/xyz/user/t-writed.csv")

is not a CSV file of rfc4180 standard format, as the original t.csv, but a t-writed.csv/ folder with the file part-00000-66b020ca-2a16-41d9-ae0a-a6a8144c7dbc-c000.csv.deflate _SUCCESS that seems a "parquet", "ORC" or other format.

Any language with a complete kit of things that "read someting" is able to "write the something", it is a kind of orthogonality principle.

Similar that not solves

Similar question or links that not solved the problem, perhaps used a incompatible Spark version, or perhaps spark-shell a limitation to use it. They have good clues for experts:

  • This similar question pointed by @JochemKuijpers: I try suggestion but obtain same ugly result.

  • This link say that there are a solution (!), but I can't copy/paste saveDfToCsv() in my spark-shell ("error: not found: type DataFrame"), some clue?


回答1:


if the dataframe is not too large you can try:

df.toPandas().to_csv(path)

if the dataframe is large you may get out of memory errors or too many open files errors.




回答2:


If you're using Spark because you're working with "big"* datasets, you probably don't want to anything like coalesce(1) or toPandas() since that will most likely crash your driver (since the whole dataset has to fit in the drivers RAM, which it usually does not).

On the other hand: If your data does fit into the RAM of a single machine - why are you torturing yourself with distributed computing?

*definitions vary. My personal is "does not fit in an excel sheet".



来源:https://stackoverflow.com/questions/58142220/how-to-write-standard-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!