How do I read a gzipped parquet file from S3 into Python using Boto3?

问题

I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3.

回答1:

The solution is actually quite straightforward.

import boto3 # For read+push to S3 bucket
import pandas as pd # Reading parquets
from io import BytesIO # Converting bytes to bytes input file
import pyarrow # Fast reading of parquets

# Set up your S3 client
# Ideally your Access Key and Secret Access Key are stored in a file already
# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
                  aws_access_key_id=ACCESS_KEY_HERE,
                  aws_secret_access_key=SECRET_ACCESS_KEY_HERE)

# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)

# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()

# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))

来源：https://stackoverflow.com/questions/55732615/how-do-i-read-a-gzipped-parquet-file-from-s3-into-python-using-boto3

标签

python

amazon-web-services

amazon-s3

boto3

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!