Read zip files from amazon s3 using boto3 and python

问题

I have an s3 bucket which has a large no of zip files having size in GBs. I need to calculate all zip files data length. I go through boto3 but didn't get it. I am not sure if it can directly read zip file or not but I have a process-

Connect with the bucket.
Read zip files from the bucket folder (Let's say folder is Mydata).
Extract zip files to another folder named Extracteddata.
Read Extracteddata folder and do action on files.

Note: Nothing shouldn't download on local storage. All process goes on S3 to S3. Any suggestions are appreciated.

回答1:

What you want to do is impossible, as explained by John Rotenstein's answer. You have to download the zipfile—not necessarily to local storage, but at least to local memory, using up your local bandwidth. There's no way to run any code on S3.

However, there may be a way to get what you're really after here anyway.

If you could just download, say, 8KB worth of the file, instead of the whole 5GB, would that be good enough? If so—and if you're willing to do a bit of work—then you're in luck. What if you had to download, say, 1MB, but could do a lot less work?

If 1MB doesn't sound too bad, and you're willing to get a little hacky:

The only thing you want is a count of how many files are in the zipfile. For a zipfile, all of that information is available in the central directory, a very small chunk of data at the very end of the file.

And if you have the entire central directory, even if you're missing the rest of the file, the zipfile module in the stdlib will handle it just fine. It isn't documented to do so, but, at least in the versions included in recent CPython and PyPy 3.x, it definitely will.

So, what you can do is this:

Make a HEAD request to get just the headers. (In boto, you do this with head_object.)
Extract the file size from the Content-Length header.
Make a GET request with a Range header to only download from, say, size-1048576 to the end. (In boto, I believe you may have to call get_object instead of one of the download* convenience methods, and you have to format the Range header value yourself.)

Now, assuming you've got that last 1MB in a buffer buf:

z = zipfile.ZipFile(io.BytesIO(buf))
count = len(z.filelist)

Usually, 1MB is more than enough. But what about when it isn't? Well, here's where things get a little hacky. The zipfile module knows how many more bytes you need—but the only place it gives you that information is in the text of the exception description. So:

try:
    z = zipfile.ZipFile(io.BytesIO(buf))
except ValueError as e:
    m = re.match(r'negative seek value -(\d+)', z.args[0])
    if not m:
        raise
    extra = int(m.group(1))
    # now go read from size-1048576-extra to size-1048576, prepend to buf, try again
count = len(z.filelist)

If 1MB already sounds like too much bandwidth, or you don't want to rely on undocumented behavior of the zipfile module, you just need to do a bit more work.

In almost every case, you don't even need the whole central directory, just the total number of entries field within the end of central directory record—an even smaller chunk of data at the very end of the central directory.

So, do the same as above, but only read the last 8KB instead of the last 1MB.

And then, based on the zip format spec, write your own parser.

Of course you don't need to write a complete parser, or even close to it. You just need enough to deal with the fields from total number of entries to the end. All of which are fixed-size fields except for zip64 extensible data sector and/or .ZIP file comment.

Occasionally (e.g., for zipfiles with huge comments), you will need to read more data to get the count. This should be pretty rare, but if, for some reason, it turns out to be more common with your zipfiles, you can just change that 8192 guess to something larger.

回答2:

This is not possible.

You can upload files to Amazon S3 and you can download files. You can query the list of objects and obtain metadata about the objects. However, Amazon S3 does not provide compute, such as zip compression/decompression.

You would need to write a program that:

Downloads the zip file
Extracts the files
Does actions on the files

This is probably best done on an Amazon EC2 instance, which would have low-latency access to Amazon S3. You could do it with an AWS Lambda function, but it has a limit of 500MB disk storage and 5 minutes of execution, which doesn't seem applicable to your situation.

If you are particularly clever, you might be able to download part of each zip file ('ranged get') and interpret the zipfile header to obtain a listing of the files and their sizes, thus avoiding having to download the whole file.

来源：https://stackoverflow.com/questions/51604689/read-zip-files-from-amazon-s3-using-boto3-and-python

标签

python

amazon-web-services

amazon-s3

boto3