发表新帖

发表新帖

How to read partitioned parquet files from S3 using pyarrow in python

前端未结

关注

 5  1297

时光说笑 2020-12-07 21:03

I looking for ways to read data from multiple partitioned directories from s3 using python.

data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parq

5条回答

离开以前 (楼主)

2020-12-07 21:30
For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;
```
pip install awswrangler
```
to read partitioned parquet from s3 using awswrangler 1.x.x and above, do;
```
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/", dataset=True)
```
By setting dataset=True awswrangler expects partitioned parquet files. It will read all the individual parquet files from your partitions below the s3 key you specify in the path.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题