Using pyarrow how do you append to parquet file?

后端 未结 3 1725
眼角桃花
眼角桃花 2021-01-30 17:19

How do you append/update to a parquet file with pyarrow?

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


 tabl         


        
3条回答
  •  無奈伤痛
    2021-01-30 18:06

    In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.

    # -*- coding: utf-8 -*-
    import numpy as np
    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    
    def append_to_parquet_table(dataframe, filepath=None, writer=None):
        """Method writes/append dataframes in parquet format.
    
        This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
        with writer, it appends dataframe to the already written pyarrow table.
    
        :param dataframe: pd.DataFrame to be written in parquet format.
        :param filepath: target file location for parquet file.
        :param writer: ParquetWriter object to write pyarrow tables in parquet format.
        :return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
            in the pyarrow Table
        """
        table = pa.Table.from_pandas(dataframe)
        if writer is None:
            writer = pq.ParquetWriter(filepath, table.schema)
        writer.write_table(table=table)
        return writer
    
    
    if __name__ == '__main__':
    
        table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
        table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
        table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
        writer = None
        filepath = '/tmp/verify_pyarrow_append.parquet'
        table_list = [table1, table2, table3]
    
        for table in table_list:
            writer = append_to_parquet_table(table, filepath, writer)
    
        if writer:
            writer.close()
    
        df = pd.read_parquet(filepath)
        print(df)
    

    Output:

       one  three  two
    0 -1.0   True  foo
    1  NaN  False  bar
    2  2.5   True  baz
    0 -1.0   True  foo
    1  NaN  False  bar
    2  2.5   True  baz
    0 -1.0   True  foo
    1  NaN  False  bar
    2  2.5   True  baz
    

提交回复
热议问题