问题
I made a simple test code that generates a lot of integers and writes them into a compressed file using the gzip module.
import gzip
for idx in range(100000):
with gzip.open('output.gz', 'ab') as f:
line = (str(idx) + '\n').encode()
f.write(line)
The compressed file is created but when I decompress it, the raw data are actually a lot smaller:
$ ls -l
588890 output
3288710 output.gz
Can you please explain what am I doing wrong here?
回答1:
The assumption that append mode would append to the existing stream is wrong. Instead it concatenates a new stream to the existing gzip file. When decompressing these are then concatenated transparently as if you had compressed a single file. But each stream contains its own header and footer and those add up. Inspecting your file reveals
% hexdump -C output.gz|head -n5
00000000 1f 8b 08 08 2e e7 03 5b 02 ff 6f 75 74 70 75 74 |.......[..output|
00000010 00 33 e0 02 00 12 cd 4a 7e 02 00 00 00 1f 8b 08 |.3.....J~.......|
00000020 08 2e e7 03 5b 02 ff 6f 75 74 70 75 74 00 33 e4 |....[..output.3.|
00000030 02 00 53 fc 51 67 02 00 00 00 1f 8b 08 08 2e e7 |..S.Qg..........|
00000040 03 5b 02 ff 6f 75 74 70 75 74 00 33 e2 02 00 90 |.[..output.3....|
Note the repetition of the magic number 1f 8b, which marks the beginning of a new stream.
In general it's usually a bad idea to repeatedly open a file in append mode in a loop. Instead open the file once for writing and write the contents in a loop:
with gzip.open('output.gz', 'wb') as f:
for idx in range(100000):
line = (str(idx) + '\n').encode()
f.write(line)
The resulting file is around 200 kiB, compared to the original 3 MiB.
来源:https://stackoverflow.com/questions/50464559/size-of-files-compressed-with-python-gzip-module-is-not-reduced