问题
There are several ways how a conversion from pandas to parquet is possible. e.g. pyarrow.Table.from_pandas or dataframe.to_parquet . What they have in common is that they get as a parameter a filePath where the df.parquet should be stored.
I need to get the content of the written parquet file into a variable and have not seen this, yet. Mainly I want the same behavior as pandas.to_csv which returns the result as a string if no path is provided.
Of course I could just write the file and read it with standard file reading operations from python into a string. As I'm writing a ton of data, this would produce a lot of load on the file system ... .
回答1:
You can either use io.BytesIO for this or alternatively Apache Arrow also provides its native implementation BufferOutputStream. The benefit of this is that this writes to the stream without the overhead of going through Python. Thus less copies are made and the GIL is released.
import pyarrow as pa
import pyarrow.parquet as pq
df = some pandas.DataFrame
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf)
# buf now contains the Parquet file in memory.
来源:https://stackoverflow.com/questions/54669196/pandas-to-parquet-not-into-file-system-but-get-content-of-resulting-file-in-vari