Create a zip file from a generator in Python?

后端 未结 10 1894
佛祖请我去吃肉
佛祖请我去吃肉 2020-11-30 07:32

I\'ve got a large amount of data (a couple gigs) I need to write to a zip file in Python. I can\'t load it all into memory at once to pass to the .writestr method of ZipFil

10条回答
  •  爱一瞬间的悲伤
    2020-11-30 08:34

    Changed in Python 3.5 (from official docs): Added support for writing to unseekable streams.

    This means that now for zipfile.ZipFile we can use streams which do not store the entire file in memory. Such streams do not support movement over the entire data volume.

    So this is simple generator:

    from zipfile import ZipFile, ZipInfo
    
    def zipfile_generator(path, stream):
        with ZipFile(stream, mode='w') as zf:
            z_info = ZipInfo.from_file(path)
            with open(path, 'rb') as entry, zf.open(z_info, mode='w') as dest:
                for chunk in iter(lambda: entry.read(16384), b''):
                    dest.write(chunk)
                    # Yield chunk of the zip file stream in bytes.
                    yield stream.get()
        # ZipFile was closed.
        yield stream.get()
    

    path is a string path of the large file or directory or pathlike object.

    stream is the unseekable stream instance of the class like this (designed according to official docs):

    from io import RawIOBase
    
    class UnseekableStream(RawIOBase):
        def __init__(self):
            self._buffer = b''
    
        def writable(self):
            return True
    
        def write(self, b):
            if self.closed:
                raise ValueError('Stream was closed!')
            self._buffer += b
            return len(b)
    
        def get(self):
            chunk = self._buffer
            self._buffer = b''
            return chunk
    

    You can try this code online: https://repl.it/@IvanErgunov/zipfilegenerator


    There is also another way to create a generator without ZipInfo and manually reading and dividing your large file. You can pass the queue.Queue() object to your UnseekableStream() object and write to this queue in another thread. Then in current thread you can simply read chunks from this queue in iterable way. See docs

    P.S. Python Zipstream by allanlei is outdated and unreliable way. It was an attempt to add support for unseekable streams before it was done officially.

提交回复
热议问题