问题
I am saving a numpy array which contains repetitive data:
import numpy as np
import gzip
import cPickle as pkl
import h5py
a = np.random.randn(100000, 10)
b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] )
f_pkl_gz = gzip.open('noise.pkl.gz', 'w')
pkl.dump(b, f_pkl_gz, protocol = pkl.HIGHEST_PROTOCOL)
f_pkl_gz.close()
f_pkl = open('noise.pkl', 'w')
pkl.dump(b, f_pkl, protocol = pkl.HIGHEST_PROTOCOL)
f_pkl.close()
f_hdf5 = h5py.File('noise.hdf5', 'w')
f_hdf5.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9)
f_hdf5.close()
Now listing the results
-rw-rw-r--. 1 alex alex 76962165 Oct  7 20:51 noise.hdf5
-rw-rw-r--. 1 alex alex 79992937 Oct  7 20:51 noise.pkl
-rw-rw-r--. 1 alex alex  8330136 Oct  7 20:51 noise.pkl.gz
So hdf5 with the highest compression takes approximately as much space as raw pickle and almost 10x the size of gzipped pickle.
Does anyone have an idea why this happens? And what can I do with this?
回答1:
The answer is to use chunks, as suggested by @tcaswell. I guess that the compression is performed separately on each chunk and the default size of the chunks is small, so there is not enough redundancy in the data for the compression to benefit from it.
Here's the code to give an idea:
import numpy as np
import gzip
import cPickle as pkl
import h5py
a = np.random.randn(100000, 10)
b = np.hstack( [a[cnt:a.shape[0]-10+cnt+1] for cnt in range(10)] )
f_hdf5_chunk_1 = h5py.File('noise_chunk_1.hdf5', 'w')
f_hdf5_chunk_1.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (1,100))
f_hdf5_chunk_1.close()
f_hdf5_chunk_10 = h5py.File('noise_chunk_10.hdf5', 'w')
f_hdf5_chunk_10.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (10,100))
f_hdf5_chunk_10.close()
f_hdf5_chunk_100 = h5py.File('noise_chunk_100.hdf5', 'w')
f_hdf5_chunk_100.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (100,100))
f_hdf5_chunk_100.close()
f_hdf5_chunk_1000 = h5py.File('noise_chunk_1000.hdf5', 'w')
f_hdf5_chunk_1000.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (1000,100))
f_hdf5_chunk_1000.close()
f_hdf5_chunk_10000 = h5py.File('noise_chunk_10000.hdf5', 'w')
f_hdf5_chunk_10000.create_dataset('b', data = b, compression = 'gzip', compression_opts = 9, chunks = (10000,100))
f_hdf5_chunk_10000.close()
And the results:
-rw-rw-r--. 1 alex alex  8341134 Oct  7 21:53 noise_chunk_10000.hdf5
-rw-rw-r--. 1 alex alex  8416441 Oct  7 21:53 noise_chunk_1000.hdf5
-rw-rw-r--. 1 alex alex  9096936 Oct  7 21:53 noise_chunk_100.hdf5
-rw-rw-r--. 1 alex alex 16304949 Oct  7 21:53 noise_chunk_10.hdf5
-rw-rw-r--. 1 alex alex 85770613 Oct  7 21:53 noise_chunk_1.hdf5
So as the chunks become smaller, the size of the file increases.
来源:https://stackoverflow.com/questions/33000256/why-do-pickle-gzip-outperform-h5py-on-repetitive-datasets