How do you append/update to a parquet
file with pyarrow
?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
tabl
In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def append_to_parquet_table(dataframe, filepath=None, writer=None):
"""Method writes/append dataframes in parquet format.
This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
with writer, it appends dataframe to the already written pyarrow table.
:param dataframe: pd.DataFrame to be written in parquet format.
:param filepath: target file location for parquet file.
:param writer: ParquetWriter object to write pyarrow tables in parquet format.
:return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
in the pyarrow Table
"""
table = pa.Table.from_pandas(dataframe)
if writer is None:
writer = pq.ParquetWriter(filepath, table.schema)
writer.write_table(table=table)
return writer
if __name__ == '__main__':
table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
writer = None
filepath = '/tmp/verify_pyarrow_append.parquet'
table_list = [table1, table2, table3]
for table in table_list:
writer = append_to_parquet_table(table, filepath, writer)
if writer:
writer.close()
df = pd.read_parquet(filepath)
print(df)
Output:
one three two
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz
0 -1.0 True foo
1 NaN False bar
2 2.5 True baz