Why does saving/loading data in python take a lot more space/time than matlab?

问题

I have some variables, which include dictionaries, list of list, and numpy arrays. I save all of them with the following code, where obj=[var1,var2,...,varn]. The variables size is small enough to be loaded in memory.

My problem is when I save the corresponding variables in matlab the output file takes a lot less space on the disk than doing it in python. Similarly, loading the variables from the disk takes a lot more time to be loaded in memory in python than matlab.

with open(filename, 'wb') as output:
    pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

Thanks

回答1:

Try this:

To save to disk

import gzip
gz = gzip.open(filename + '.gz', 'wb')
gz.write(pickle.dumps(obj, pickle.HIGHEST_PROTOCOL))
gz.close()

To load from disk

import gzip
gz = gzip.open(filename + '.gz', 'rb')
obj = pickle.loads(gz.read())
gz.close()

回答2:

Matlab uses HDF5 and compression to save mat-Files; HDF5 is a format to access large amount of data very fast. Python-pickle safes information to recreate the objects, it's not optimized for speed and size but flexibility. If you like, use HDF5 for python.

回答3:

Well, the issue is with pickle not Python per se. As others have mentioned, .mat files saved in version 7.3 or higher, use HDF5 format. HDF5 is optimized to efficiently store and retrieve large datasets; Pickle handles data differently. You can replicate or even surpass the performance of Matlab's save function by using the h5py or netcf4 Python modules; NetCDF is a subset of HDF5. For example, using HDF5, you may do:

import h5py
import numpy as np

f = h5py.File('test.hdf5','w')
a = np.arange(10)
dset = f.create_dataset("init", data=a)
f.close()

I'm not sure if doing the equivalent in MATLAB will result in a file of exactly the same size but it should be close. You can play around to with the HDF5's compression features to get the results you want.

Edit 1:

To load an HDF5 file, such as .mat file, you could do something like M2 = h5py.File('file.mat'). M2 is a HDF5 group, which is kinda like a python dictionary. Doing M2.keys() gives you the variable names. If one of the variables is an array called "data", you can read it out by doing data = M2["data"][:].

Edit 2:

To save multiple variables, you can create multiple datasets. The basic syntax is f.create_dataset("variable_name", data=variable). See link for more options. For e.g.

import h5py
import numpy as np

f = h5py.File('test.hdf5','w')

data1 = np.ones((4,4))
data2 = 2*data1
f.create_dataset("ones", data=data1)
f.create_dataset("twos", data=data2)

f is both a file object and a HDF5 group. So doing f.keys() gives:

[u'ones', u'twos']

To view what's stored under the 'ones' key, you would do:

f['ones'][:]

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

You can create as many datasets as you would like. When you're done writing files, close the file object: f.close().

I should add that my approach here only works for array-like datasets. You can save other Python objects, such as lists and dictionaries, but doing so requires a bit more work. I only resort to HDF5 for large numpy arrays. For everything else, pickle works just fine for me.

来源：https://stackoverflow.com/questions/25712482/why-does-saving-loading-data-in-python-take-a-lot-more-space-time-than-matlab

标签

python

matlab

file-io

numpy

mat-file