Streaming decompression of zip archives in python

有些话、适合烂在心里 提交于 2019-12-11 02:17:10

问题


Is there a way to do streaming decompression of single-file zip archives?

I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.

A simple example:

import boto

def count_newlines(bucket_name, key_name):
    conn = boto.connect_s3()
    b = conn.get_bucket(bucket_name)
    # key is a .zip file
    key = b.get_key(key_name)

    count = 0
    for chunk in key:
        # How should decompress happen?
        count += decompress(chunk).count('\n')

    return count

This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile module, as it seems to require random access to the entire file being unzipped.


回答1:


You can use https://pypi.python.org/pypi/tubing, it even has built in s3 source support using boto3.

from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
    | pipes.Gunzip() \
    | pipes.Split(on=b'\n') \
    | sinks.Objects()
print len(output)

If you didn't want to store the entire output in the returned sink, you could make your own sink that just counts. The impl would look like:

class CountWriter(object):
    def __init__(self):
        self.count = 0
    def write(self, chunk):
        self.count += len(chunk)
Counter = sinks.MakeSink(CountWriter)



回答2:


The zip header is at the end of the file, which is why it needs random access. See https://en.wikipedia.org/wiki/Zip_(file_format)#Structure.

You could parse the local file header which should be at the start of the file for a simple zip, and decompress the bytes with zlib (see zipfile.py). This is not a valid way to read a zip file, and while it might work for your specific scenario, it could also fail on a lot of valid zips. Reading the central directory file header is the only right way to read a zip.




回答3:


Yes, but you'll likely have to write your own code to do it if it has to be in Python. You can look at sunzip for an example in C for how to unzip a zip file from a stream. sunzip creates temporary files as it decompresses the zip entries, and then moves those files and sets their attributes appropriately upon reading the central directory at the end. Claims that you must be able to seek to the central directory in order to properly unzip a zip file are incorrect.




回答4:


You can do it in Python 3.4.3 using ZipFile as follows:

with ZipFile('spam.zip') as myzip:
    with myzip.open('eggs.txt') as myfile:
        print(myfile.read())

Python Docs



来源:https://stackoverflow.com/questions/29375201/streaming-decompression-of-zip-archives-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!