I looking for ways to read data from multiple partitioned directories from s3 using python.
data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parq
For python 3.6+ AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
to read partitioned parquet from s3 using awswrangler 1.x.x
and above, do;
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://my_bucket/path/to/data_folder/", dataset=True)
By setting dataset=True
awswrangler expects partitioned parquet files. It will read all the individual parquet files from your partitions below the s3 key you specify in the path
.