Compress a file in memory, compute checksum and write it as `gzip` in python

问题

I want to compress files and compute the checksum of the compressed file using python. My first naive attempt was to use 2 functions:

def compress_file(input_filename, output_filename):
    f_in = open(input_filename, 'rb')
    f_out = gzip.open(output_filename, 'wb')
    f_out.writelines(f_in)
    f_out.close()
    f_in.close()


def md5sum(filename):
    with open(filename) as f:
        md5 = hashlib.md5(f.read()).hexdigest()
    return md5

However, it leads to the compressed file being written and then re-read. With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.

How can I compress the file in a buffer and then compute the checksum from this buffer before writing the output file?

The file are not that big so I can afford to store everything in memory. However, a nice incremental version could be nice too.

The last requirement is that it should work with multiprocessing (in order to compress several files in parallel).

I have tried to use zlib.compress but the returned string miss the header of a gzip file.

Edit: following @abarnert sggestion, I used python3 gzip.compress:

def compress_md5(input_filename, output_filename):
    f_in = open(input_filename, 'rb')
    # Read in buffer
    buff = f_in.read()
    f_in.close()
    # Compress this buffer
    c_buff = gzip.compress(buff)
    # Compute MD5
    md5 = hashlib.md5(c_buff).hexdigest()
    # Write compressed buffer
    f_out = open(output_filename, 'wb')
    f_out.write(c_buff)
    f_out.close()

    return md5

This produce a correct gzip file but the output is different at each run (the md5 is different):

>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'0d0eb6a5f3fe2c1f3201bc3360201f71'
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'8e4954ab5914a1dd0d8d0deb114640e5'

The gzip program doesn't have this problem:

 $ gzip -c 4327_010.pdf | md5sum
 8965184bc4dace5325c41cc75c5837f1  -
 $ gzip -c 4327_010.pdf | md5sum
 8965184bc4dace5325c41cc75c5837f1  -

I guess it's because the gzip module use the current time by default when creating a file (the gzip program use the modification of the input file I guess). There is no way to change that with gzip.compress.

I was thinking to create a gzip.GzipFile in read/write mode, controlling the mtime but there is no such mode for gzip.GzipFile.

Inspired by @zwol suggestion I wrote the following function which correctly sets the filename and the OS (Unix) in the header:

def compress_md5(input_filename, output_filename):
    f_in = open(input_filename, 'rb')    
    # Read data in buffer
    buff = f_in.read()
    # Create output buffer
    c_buff = cStringIO.StringIO()
    # Create gzip file
    input_file_stat = os.stat(input_filename)
    mtime = input_file_stat[8]
    gzip_obj = gzip.GzipFile(input_filename, mode="wb", fileobj=c_buff, mtime=mtime)
    # Compress data in memory
    gzip_obj.write(buff)
    # Close files
    f_in.close()
    gzip_obj.close()
    # Retrieve compressed data
    c_data = c_buff.getvalue()
    # Change OS value
    c_data = c_data[0:9] + '\003' + c_data[10:]
    # Really write compressed data
    f_out = open(output_filename, "wb")
    f_out.write(c_data)
    # Compute MD5
    md5 = hashlib.md5(c_data).hexdigest()
    return md5

The output is the same at different run. Moreover the output of file is the same than gzip:

$ gzip -9 -c 4327_010.pdf > ref_max/4327_010.pdf.gz
$ file ref_max/4327_010.pdf.gz 
ref_max/4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May  5 14:28:16 2015, max compression
$ file 4327_010.pdf.gz 
4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May  5 14:28:16 2015, max compression

However, md5 is different:

$ md5sum 4327_010.pdf.gz ref_max/4327_010.pdf.gz 
39dc3e5a52c71a25c53fcbc02e2702d5  4327_010.pdf.gz
213a599a382cd887f3c4f963e1d3dec4  ref_max/4327_010.pdf.gz

gzip -l is also different:

$ gzip -l ref_max/4327_010.pdf.gz 4327_010.pdf.gz 
     compressed        uncompressed  ratio uncompressed_name
        7286404             7600522   4.1% ref_max/4327_010.pdf
        7297310             7600522   4.0% 4327_010.pdf

I guess it's because the gzip program and the python gzip module (which is based on the C library zlib) have a slightly different algorithm.

回答1:

Wrap a gzip.GzipFile object around an io.BytesIO object. (In Python 2, use cStringIO.StringIO instead.) After you close the GzipFile, you can retrieve the compressed data from the BytesIO object (using getvalue), hash it, and write it out to a real file.

Incidentally, you really shouldn't be using MD5 at all anymore.

回答2:

I have tried to use zlib.compress but the returned string miss the header of a gzip file.

Of course. That's the whole difference between the zlib module and the gzip module; zlib just deals with zlib-deflate compression without gzip headers, gzip deals with zlib-deflate data with gzip headers.

So, just call gzip.compress instead, and the code you wrote but didn't show us should just work.

As a side note:

with open(filename) as f:
    md5 = hashlib.md5(f.read()).hexdigest()

You almost certainly want to open the file in 'rb' mode here. You don't want to convert '\r\n' into '\n' (if on Windows), or decode the binary data as sys.getdefaultencoding() text (if on Python 3), so open it in binary mode.

Another side note:

Don't use line-based APIs on binary files. Instead of this:

f_out.writelines(f_in)

… do this:

f_out.write(f_in.read())

Or, if the files are too large to read into memory all at once:

for buf in iter(partial(f_in.read, 8192), b''):
    f_out.write(buf)

And one last point:

With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.

Does your system not have a tmp directory mounted on a faster drive?

In most cases, you don't need a real file. Either there's a string-based API (zlib.compress, gzip.compress, json.dumps, etc.), or the file-based API only requires a file-like object, like a BytesIO.

But when you do need a real temporary file, with a real file descriptor and everything, you almost always want to create it in the temporary directory.^* In Python, you do this with the tempfile module.

For example:

def compress_and_md5(filename):
    with tempfile.NamedTemporaryFile() as f_out:
        with open(filename, 'rb') as f_in:
            g_out = gzip.open(f_out)
            g_out.write(f_in.read())
        f_out.seek(0)
        md5 = hashlib.md5(f_out.read()).hexdigest()

If you need an actual filename, rather than a file object, you can use f_in.name.

_{* The one exception is when you only want the temporary file to eventually rename it to a permanent location. In that case, of course, you usually want the temporary file to be in the same directory as the permanent location. But you can do that with tempfile just as easily. Just remember to pass delete=False.}

来源：https://stackoverflow.com/questions/30113119/compress-a-file-in-memory-compute-checksum-and-write-it-as-gzip-in-python

标签

python

gzip

checksum