Create a zip file from a generator in Python?

后端 未结 10 1888
佛祖请我去吃肉
佛祖请我去吃肉 2020-11-30 07:32

I\'ve got a large amount of data (a couple gigs) I need to write to a zip file in Python. I can\'t load it all into memory at once to pass to the .writestr method of ZipFil

相关标签:
10条回答
  • 2020-11-30 08:27

    Now with python 2.7 you can add data to the zipfile insted of the file :

    http://docs.python.org/2/library/zipfile#zipfile.ZipFile.writestr

    0 讨论(0)
  • 2020-11-30 08:33

    The essential compression is done by zlib.compressobj. ZipFile (under Python 2.5 on MacOSX appears to be compiled). The Python 2.3 version is as follows.

    You can see that it builds the compressed file in 8k chunks. Taking out the source file information is complex because a lot of source file attributes (like uncompressed size) is recorded in the zip file header.

    def write(self, filename, arcname=None, compress_type=None):
        """Put the bytes from filename into the archive under the name
        arcname."""
    
        st = os.stat(filename)
        mtime = time.localtime(st.st_mtime)
        date_time = mtime[0:6]
        # Create ZipInfo instance to store file information
        if arcname is None:
            zinfo = ZipInfo(filename, date_time)
        else:
            zinfo = ZipInfo(arcname, date_time)
        zinfo.external_attr = st[0] << 16L      # Unix attributes
        if compress_type is None:
            zinfo.compress_type = self.compression
        else:
            zinfo.compress_type = compress_type
        self._writecheck(zinfo)
        fp = open(filename, "rb")
    
        zinfo.flag_bits = 0x00
        zinfo.header_offset = self.fp.tell()    # Start of header bytes
        # Must overwrite CRC and sizes with correct data later
        zinfo.CRC = CRC = 0
        zinfo.compress_size = compress_size = 0
        zinfo.file_size = file_size = 0
        self.fp.write(zinfo.FileHeader())
        zinfo.file_offset = self.fp.tell()      # Start of file bytes
        if zinfo.compress_type == ZIP_DEFLATED:
            cmpr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,
                 zlib.DEFLATED, -15)
        else:
            cmpr = None
        while 1:
            buf = fp.read(1024 * 8)
            if not buf:
                break
            file_size = file_size + len(buf)
            CRC = binascii.crc32(buf, CRC)
            if cmpr:
                buf = cmpr.compress(buf)
                compress_size = compress_size + len(buf)
            self.fp.write(buf)
        fp.close()
        if cmpr:
            buf = cmpr.flush()
            compress_size = compress_size + len(buf)
            self.fp.write(buf)
            zinfo.compress_size = compress_size
        else:
            zinfo.compress_size = file_size
        zinfo.CRC = CRC
        zinfo.file_size = file_size
        # Seek backwards and write CRC and file sizes
        position = self.fp.tell()       # Preserve current position in file
        self.fp.seek(zinfo.header_offset + 14, 0)
        self.fp.write(struct.pack("<lLL", zinfo.CRC, zinfo.compress_size,
              zinfo.file_size))
        self.fp.seek(position, 0)
        self.filelist.append(zinfo)
        self.NameToInfo[zinfo.filename] = zinfo
    
    0 讨论(0)
  • 2020-11-30 08:33

    In case anyone stumbles upon this question, which is still relevant in 2017 for Python 2.7, here's a working solution for a true streaming zip file, with no requirement for the output to be seekable as in the other cases. The secret is to set bit 3 of the general purpose bit flag (see https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT section 4.3.9.1).

    Note that this implementation will always create a ZIP64-style file, allowing the streaming to work for arbitrarily large files. It includes an ugly hack to force the zip64 end of central directory record, so be aware it will cause all zipfiles written by your process to become ZIP64-style.

    import io
    import zipfile
    import zlib
    import binascii
    import struct
    
    class ByteStreamer(io.BytesIO):
        '''
        Variant on BytesIO which lets you write and consume data while
        keeping track of the total filesize written. When data is consumed
        it is removed from memory, keeping the memory requirements low.
        '''
        def __init__(self):
            super(ByteStreamer, self).__init__()
            self._tellall = 0
    
        def tell(self):
            return self._tellall
    
        def write(self, b):
            orig_size = super(ByteStreamer, self).tell()
            super(ByteStreamer, self).write(b)
            new_size = super(ByteStreamer, self).tell()
            self._tellall += (new_size - orig_size)
    
        def consume(self):
            bytes = self.getvalue()
            self.seek(0)
            self.truncate(0)
            return bytes
    
    class BufferedZipFileWriter(zipfile.ZipFile):
        '''
        ZipFile writer with true streaming (input and output).
        Created zip files are always ZIP64-style because it is the only safe way to stream
        potentially large zip files without knowing the full size ahead of time.
    
        Example usage:
        >>> def stream():
        >>>     bzfw = BufferedZip64FileWriter()
        >>>     for arc_path, buffer in inputs:  # buffer is a file-like object which supports read(size)
        >>>         for chunk in bzfw.streambuffer(arc_path, buffer):
        >>>             yield chunk
        >>>     yield bzfw.close()
        '''
        def __init__(self, compression=zipfile.ZIP_DEFLATED):
            self._buffer = ByteStreamer()
            super(BufferedZipFileWriter, self).__init__(self._buffer, mode='w', compression=compression, allowZip64=True)
    
        def streambuffer(self, zinfo_or_arcname, buffer, chunksize=2**16):
            if not isinstance(zinfo_or_arcname, zipfile.ZipInfo):
                zinfo = zipfile.ZipInfo(filename=zinfo_or_arcname,
                                        date_time=time.localtime(time.time())[:6])
                zinfo.compress_type = self.compression
                zinfo.external_attr = 0o600 << 16     # ?rw-------
            else:
                zinfo = zinfo_or_arcname
    
            zinfo.file_size = file_size = 0
            zinfo.flag_bits = 0x08  # Streaming mode: crc and size come after the data
            zinfo.header_offset = self.fp.tell()
    
            self._writecheck(zinfo)
            self._didModify = True
    
            zinfo.CRC = CRC = 0
            zinfo.compress_size = compress_size = 0
            self.fp.write(zinfo.FileHeader())
            if zinfo.compress_type == zipfile.ZIP_DEFLATED:
                cmpr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15)
            else:
                cmpr = None
    
            while True:
                buf = buffer.read(chunksize)
                if not buf:
                    break
    
                file_size += len(buf)
                CRC = binascii.crc32(buf, CRC) & 0xffffffff
                if cmpr:
                    buf = cmpr.compress(buf)
                    compress_size += len(buf)
    
                self.fp.write(buf)
                compressed_bytes = self._buffer.consume()
                if compressed_bytes:
                    yield compressed_bytes
    
            if cmpr:
                buf = cmpr.flush()
                compress_size += len(buf)
                self.fp.write(buf)
                zinfo.compress_size = compress_size
                compressed_bytes = self._buffer.consume()
                if compressed_bytes:
                    yield compressed_bytes
            else:
                zinfo.compress_size = file_size
    
            zinfo.CRC = CRC
            zinfo.file_size = file_size
    
            # Write CRC and file sizes after the file data
            # Always write as zip64 -- only safe way to stream what might become a large zipfile
            fmt = '<LQQ'
            self.fp.write(struct.pack(fmt, zinfo.CRC, zinfo.compress_size, zinfo.file_size))
    
            self.fp.flush()
            self.filelist.append(zinfo)
            self.NameToInfo[zinfo.filename] = zinfo
            yield self._buffer.consume()
    
        # The close method needs to be patched to force writing a ZIP64 file
        # We'll hack ZIP_FILECOUNT_LIMIT to do the forcing
        def close(self):
            tmp = zipfile.ZIP_FILECOUNT_LIMIT
            zipfile.ZIP_FILECOUNT_LIMIT = 0
            super(BufferedZipFileWriter, self).close()
            zipfile.ZIP_FILECOUNT_LIMIT = tmp
            return self._buffer.consume()
    
    0 讨论(0)
  • 2020-11-30 08:34

    Changed in Python 3.5 (from official docs): Added support for writing to unseekable streams.

    This means that now for zipfile.ZipFile we can use streams which do not store the entire file in memory. Such streams do not support movement over the entire data volume.

    So this is simple generator:

    from zipfile import ZipFile, ZipInfo
    
    def zipfile_generator(path, stream):
        with ZipFile(stream, mode='w') as zf:
            z_info = ZipInfo.from_file(path)
            with open(path, 'rb') as entry, zf.open(z_info, mode='w') as dest:
                for chunk in iter(lambda: entry.read(16384), b''):
                    dest.write(chunk)
                    # Yield chunk of the zip file stream in bytes.
                    yield stream.get()
        # ZipFile was closed.
        yield stream.get()
    

    path is a string path of the large file or directory or pathlike object.

    stream is the unseekable stream instance of the class like this (designed according to official docs):

    from io import RawIOBase
    
    class UnseekableStream(RawIOBase):
        def __init__(self):
            self._buffer = b''
    
        def writable(self):
            return True
    
        def write(self, b):
            if self.closed:
                raise ValueError('Stream was closed!')
            self._buffer += b
            return len(b)
    
        def get(self):
            chunk = self._buffer
            self._buffer = b''
            return chunk
    

    You can try this code online: https://repl.it/@IvanErgunov/zipfilegenerator


    There is also another way to create a generator without ZipInfo and manually reading and dividing your large file. You can pass the queue.Queue() object to your UnseekableStream() object and write to this queue in another thread. Then in current thread you can simply read chunks from this queue in iterable way. See docs

    P.S. Python Zipstream by allanlei is outdated and unreliable way. It was an attempt to add support for unseekable streams before it was done officially.

    0 讨论(0)
提交回复
热议问题