问题
Is there a way to do streaming decompression of single-file zip archives?
I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.
A simple example:
import boto
def count_newlines(bucket_name, key_name):
conn = boto.connect_s3()
b = conn.get_bucket(bucket_name)
# key is a .zip file
key = b.get_key(key_name)
count = 0
for chunk in key:
# How should decompress happen?
count += decompress(chunk).count('\n')
return count
This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile
module, as it seems to require random access to the entire file being unzipped.
回答1:
You can use https://pypi.python.org/pypi/tubing, it even has built in s3 source support using boto3.
from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
| pipes.Gunzip() \
| pipes.Split(on=b'\n') \
| sinks.Objects()
print len(output)
If you didn't want to store the entire output in the returned sink, you could make your own sink that just counts. The impl would look like:
class CountWriter(object):
def __init__(self):
self.count = 0
def write(self, chunk):
self.count += len(chunk)
Counter = sinks.MakeSink(CountWriter)
回答2:
The zip header is at the end of the file, which is why it needs random access. See https://en.wikipedia.org/wiki/Zip_(file_format)#Structure.
You could parse the local file header which should be at the start of the file for a simple zip, and decompress the bytes with zlib
(see zipfile.py). This is not a valid way to read a zip file, and while it might work for your specific scenario, it could also fail on a lot of valid zips. Reading the central directory file header is the only right way to read a zip.
回答3:
Yes, but you'll likely have to write your own code to do it if it has to be in Python. You can look at sunzip for an example in C for how to unzip a zip file from a stream. sunzip creates temporary files as it decompresses the zip entries, and then moves those files and sets their attributes appropriately upon reading the central directory at the end. Claims that you must be able to seek to the central directory in order to properly unzip a zip file are incorrect.
回答4:
You can do it in Python 3.4.3 using ZipFile as follows:
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
print(myfile.read())
Python Docs
来源:https://stackoverflow.com/questions/29375201/streaming-decompression-of-zip-archives-in-python