问题
I want to compress files and compute the checksum of the compressed file using python. My first naive attempt was to use 2 functions:
def compress_file(input_filename, output_filename):
f_in = open(input_filename, 'rb')
f_out = gzip.open(output_filename, 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
def md5sum(filename):
with open(filename) as f:
md5 = hashlib.md5(f.read()).hexdigest()
return md5
However, it leads to the compressed file being written and then re-read. With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.
How can I compress the file in a buffer and then compute the checksum from this buffer before writing the output file?
The file are not that big so I can afford to store everything in memory. However, a nice incremental version could be nice too.
The last requirement is that it should work with multiprocessing (in order to compress several files in parallel).
I have tried to use zlib.compress
but the returned string miss the header of a gzip file.
Edit: following @abarnert sggestion, I used python3 gzip.compress
:
def compress_md5(input_filename, output_filename):
f_in = open(input_filename, 'rb')
# Read in buffer
buff = f_in.read()
f_in.close()
# Compress this buffer
c_buff = gzip.compress(buff)
# Compute MD5
md5 = hashlib.md5(c_buff).hexdigest()
# Write compressed buffer
f_out = open(output_filename, 'wb')
f_out.write(c_buff)
f_out.close()
return md5
This produce a correct gzip file but the output is different at each run (the md5 is different):
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'0d0eb6a5f3fe2c1f3201bc3360201f71'
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'8e4954ab5914a1dd0d8d0deb114640e5'
The gzip
program doesn't have this problem:
$ gzip -c 4327_010.pdf | md5sum
8965184bc4dace5325c41cc75c5837f1 -
$ gzip -c 4327_010.pdf | md5sum
8965184bc4dace5325c41cc75c5837f1 -
I guess it's because the gzip
module use the current time by default when creating a file (the gzip
program use the modification of the input file I guess). There is no way to change that with gzip.compress
.
I was thinking to create a gzip.GzipFile
in read/write mode, controlling the mtime but there is no such mode for gzip.GzipFile
.
Inspired by @zwol suggestion I wrote the following function which correctly sets the filename and the OS (Unix) in the header:
def compress_md5(input_filename, output_filename):
f_in = open(input_filename, 'rb')
# Read data in buffer
buff = f_in.read()
# Create output buffer
c_buff = cStringIO.StringIO()
# Create gzip file
input_file_stat = os.stat(input_filename)
mtime = input_file_stat[8]
gzip_obj = gzip.GzipFile(input_filename, mode="wb", fileobj=c_buff, mtime=mtime)
# Compress data in memory
gzip_obj.write(buff)
# Close files
f_in.close()
gzip_obj.close()
# Retrieve compressed data
c_data = c_buff.getvalue()
# Change OS value
c_data = c_data[0:9] + '\003' + c_data[10:]
# Really write compressed data
f_out = open(output_filename, "wb")
f_out.write(c_data)
# Compute MD5
md5 = hashlib.md5(c_data).hexdigest()
return md5
The output is the same at different run. Moreover the output of file
is the same than gzip
:
$ gzip -9 -c 4327_010.pdf > ref_max/4327_010.pdf.gz
$ file ref_max/4327_010.pdf.gz
ref_max/4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May 5 14:28:16 2015, max compression
$ file 4327_010.pdf.gz
4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May 5 14:28:16 2015, max compression
However, md5 is different:
$ md5sum 4327_010.pdf.gz ref_max/4327_010.pdf.gz
39dc3e5a52c71a25c53fcbc02e2702d5 4327_010.pdf.gz
213a599a382cd887f3c4f963e1d3dec4 ref_max/4327_010.pdf.gz
gzip -l
is also different:
$ gzip -l ref_max/4327_010.pdf.gz 4327_010.pdf.gz
compressed uncompressed ratio uncompressed_name
7286404 7600522 4.1% ref_max/4327_010.pdf
7297310 7600522 4.0% 4327_010.pdf
I guess it's because the gzip
program and the python gzip
module (which is based on the C library zlib
) have a slightly different algorithm.
回答1:
Wrap a gzip.GzipFile object around an io.BytesIO object. (In Python 2, use cStringIO.StringIO
instead.) After you close the GzipFile
, you can retrieve the compressed data from the BytesIO
object (using getvalue
), hash it, and write it out to a real file.
Incidentally, you really shouldn't be using MD5 at all anymore.
回答2:
I have tried to use
zlib.compress
but the returned string miss the header of a gzip file.
Of course. That's the whole difference between the zlib module and the gzip module; zlib
just deals with zlib-deflate compression without gzip headers, gzip
deals with zlib-deflate data with gzip headers.
So, just call gzip.compress instead, and the code you wrote but didn't show us should just work.
As a side note:
with open(filename) as f:
md5 = hashlib.md5(f.read()).hexdigest()
You almost certainly want to open the file in 'rb'
mode here. You don't want to convert '\r\n'
into '\n'
(if on Windows), or decode the binary data as sys.getdefaultencoding()
text (if on Python 3), so open it in binary mode.
Another side note:
Don't use line-based APIs on binary files. Instead of this:
f_out.writelines(f_in)
… do this:
f_out.write(f_in.read())
Or, if the files are too large to read into memory all at once:
for buf in iter(partial(f_in.read, 8192), b''):
f_out.write(buf)
And one last point:
With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.
Does your system not have a tmp directory mounted on a faster drive?
In most cases, you don't need a real file. Either there's a string-based API (zlib.compress
, gzip.compress
, json.dumps
, etc.), or the file-based API only requires a file-like object, like a BytesIO
.
But when you do need a real temporary file, with a real file descriptor and everything, you almost always want to create it in the temporary directory.* In Python, you do this with the tempfile module.
For example:
def compress_and_md5(filename):
with tempfile.NamedTemporaryFile() as f_out:
with open(filename, 'rb') as f_in:
g_out = gzip.open(f_out)
g_out.write(f_in.read())
f_out.seek(0)
md5 = hashlib.md5(f_out.read()).hexdigest()
If you need an actual filename, rather than a file object, you can use f_in.name
.
* The one exception is when you only want the temporary file to eventually rename
it to a permanent location. In that case, of course, you usually want the temporary file to be in the same directory as the permanent location. But you can do that with tempfile
just as easily. Just remember to pass delete=False
.
来源:https://stackoverflow.com/questions/30113119/compress-a-file-in-memory-compute-checksum-and-write-it-as-gzip-in-python