Pandas to parquet NOT into file-system but get content of resulting file in variable

问题

There are several ways how a conversion from pandas to parquet is possible. e.g. pyarrow.Table.from_pandas or dataframe.to_parquet . What they have in common is that they get as a parameter a filePath where the df.parquet should be stored.

I need to get the content of the written parquet file into a variable and have not seen this, yet. Mainly I want the same behavior as pandas.to_csv which returns the result as a string if no path is provided.

Of course I could just write the file and read it with standard file reading operations from python into a string. As I'm writing a ton of data, this would produce a lot of load on the file system ... .

回答1:

You can either use io.BytesIO for this or alternatively Apache Arrow also provides its native implementation BufferOutputStream. The benefit of this is that this writes to the stream without the overhead of going through Python. Thus less copies are made and the GIL is released.

import pyarrow as pa
import pyarrow.parquet as pq

df = some pandas.DataFrame
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf)
# buf now contains the Parquet file in memory.

来源：https://stackoverflow.com/questions/54669196/pandas-to-parquet-not-into-file-system-but-get-content-of-resulting-file-in-vari

标签

python

pandas

parquet

pyarrow

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!