How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

后端 未结 7 1751
小蘑菇
小蘑菇 2020-12-04 09:15

I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3).

First, I can read a single parq

7条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-04 09:38

    Probably the easiest way to read parquet data on the cloud into dataframes is to use dask.dataframe in this way:

    import dask.dataframe as dd
    df = dd.read_parquet('s3://bucket/path/to/data-*.parq')
    

    dask.dataframe can read from Google Cloud Storage, Amazon S3, Hadoop file system and more!

提交回复
热议问题