How to read partitioned parquet files from S3 using pyarrow in python

前端 未结 5 1297
时光说笑
时光说笑 2020-12-07 21:03

I looking for ways to read data from multiple partitioned directories from s3 using python.

data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parq

5条回答
  •  离开以前
    2020-12-07 21:30

    For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

    to install do;

    pip install awswrangler
    

    to read partitioned parquet from s3 using awswrangler 1.x.x and above, do;

    import awswrangler as wr
    df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/", dataset=True)
    

    By setting dataset=True awswrangler expects partitioned parquet files. It will read all the individual parquet files from your partitions below the s3 key you specify in the path.

提交回复
热议问题