发表新帖

发表新帖

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

后端未结

关注

 7  1724

小蘑菇 2020-12-04 09:15

I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3).

First, I can read a single parq

7条回答

离开以前 (楼主)

2020-12-04 09:31
You should use the s3fs module as proposed by yjk21. However as result of calling ParquetDataset you'll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you'll rather want to apply .read_pandas().to_pandas() to it:
```
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()
```
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题