How can I convert a Pyspark dataframe to a CSV without sending it to a file?

问题

I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this?

回答1:

Easy way: convert your dataframe to Pandas dataframe with toPandas(), then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None. Then send the string in an API call.

From to_csv() documentation:

Parameters

path_or_bufstr or file handle, default None

File path or object, if None is provided the result is returned as a string.

So your code would likely look like this:

csv_string = df.toPandas().to_csv(path_or_bufstr=None)

Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.

来源：https://stackoverflow.com/questions/61645936/how-can-i-convert-a-pyspark-dataframe-to-a-csv-without-sending-it-to-a-file

标签

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!