Reading contents of a gzip file from a AWS S3 in Python

问题

I am trying to read some logs from a Hadoop process that I run in AWS. The logs are stored in an S3 folder and have the following path.

bucketname = name key = y/z/stderr.gz Here Y is the cluster id and z is a folder name. Both of these act as folders(objects) in AWS. So the full path is like x/y/z/stderr.gz.

Now I want to unzip this .gz file and read the contents of the file. I don't want to download this file to my system wants to save contents in a python variable.

This is what I have tried till now.

bucket_name = "name"
key = "y/z/stderr.gz"
obj = s3.Object(bucket_name,key)
n = obj.get()['Body'].read()

This is giving me a format which is not readable. I also tried

n = obj.get()['Body'].read().decode('utf-8')

which gives an error utf8' codec can't decode byte 0x8b in position 1: invalid start byte.

I have also tried

gzip = StringIO(obj)
gzipfile = gzip.GzipFile(fileobj=gzip)
content = gzipfile.read()

This returns an error IOError: Not a gzipped file

Not sure how to decode this .gz file.

Edit - Found a solution. Needed to pass n in it and use BytesIO

gzip = BytesIO(n)

回答1:

@Amit, I was trying to do the same thing to test decoding a file, and got your code to run with some modifications. I just had to remove the function def, the return, and rename the gzip variable, since that name is in use.

import json
import boto3
from io import BytesIO
import gzip

try:
     s3 = boto3.resource('s3')
     key='YOUR_FILE_NAME.gz'
     obj = s3.Object('YOUR_BUCKET_NAME',key)
     n = obj.get()['Body'].read()
     gzipfile = BytesIO(n)
     gzipfile = gzip.GzipFile(fileobj=gzipfile)
     content = gzipfile.read()
     print(content)
except Exception as e:
    print(e)
    raise e

回答2:

You can use AWS S3 SELECT Object Content to read gzip contents

S3 Select is an Amazon S3 capability designed to pull out only the data you need from an object, which can dramatically improve the performance and reduce the cost of applications that need to access data in S3.

Amazon S3 Select works on objects stored in Apache Parquet format, JSON Arrays, and BZIP2 compression for CSV and JSON objects.

Ref: https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html

from io import StringIO
import boto3
import pandas as pd

bucket = 'my-bucket'
prefix = 'my-prefix'

client = boto3.client('s3')

for object in client.list_objects_v2(Bucket=bucket, Prefix=prefix)['Contents']:
    if object['Size'] <= 0:
        continue

    print(object['Key'])
    r = client.select_object_content(
            Bucket=bucket,
            Key=object['Key'],
            ExpressionType='SQL',
            Expression="select * from s3object",
            InputSerialization = {'CompressionType': 'GZIP', 'JSON': {'Type': 'DOCUMENT'}},
            OutputSerialization = {'CSV': {'QuoteFields': 'ASNEEDED', 'RecordDelimiter': '\n', 'FieldDelimiter': ',', 'QuoteCharacter': '"', 'QuoteEscapeCharacter': '"'}},
        )

    for event in r['Payload']:
        if 'Records' in event:
            records = event['Records']['Payload'].decode('utf-8')
            payloads = (''.join(r for r in records))
            try:
                select_df = pd.read_csv(StringIO(payloads), error_bad_lines=False)
                for row in select_df.iterrows():
                    print(row)
            except Exception as e:
                print(e)

回答3:

This is old, but you no longer need the BytesIO object in the middle of it (at least on my boto3==1.9.223 and python3.7)

import boto3
import gzip

s3 = boto3.resource("s3")
obj = s3.Object("YOUR_BUCKET_NAME", "path/to/your_key.gz")
with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipfile:
    content = gzipfile.read()
print(content)

回答4:

Read Bz2 extension file from aws s3 in python

import json
import boto3
from io import BytesIO
import bz2
try:
    s3 = boto3.resource('s3')
    key='key_name.bz2'
    obj = s3.Object('bucket_name',key)
    nn = obj.get()['Body'].read()
    gzipfile = BytesIO(nn)
    content = bz2.decompress(gzipfile.read())
    content = content.split('\n')
    print len(content)

except Exception as e:
    print(e)

回答5:

Just like what we do with variables, data can be kept as bytes in an in-memory buffer when we use the io module’s Byte IO operations.

Here is a sample program to demonstrate this:

mport io

stream_str = io.BytesIO(b"JournalDev Python: \x00\x01")
print(stream_str.getvalue())

The getvalue() function takes the value from the Buffer as a String.

So, the @Jean-FrançoisFabre answer is correct, and you should use

gzip = BytesIO(n)

For more information read the following doc:

https://docs.python.org/3/library/io.html

回答6:

Currently the file can be read as

import pandas as pd
role = 'role name'
bucket = 'bucket name'
data_key = 'data key'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = pd.read_csv(data_location,compression='gzip', header=0, sep=',', quotechar='"')

来源：https://stackoverflow.com/questions/41161006/reading-contents-of-a-gzip-file-from-a-aws-s3-in-python

标签

python

amazon-web-services

amazon-s3

boto3