How to hash a large object (dataset) in Python?

I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5 or sha1. The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__() member. Currently I do a pickle.dumps() for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:

What would be the best method to calculate a hash for a Python class containing Numpy arrays?

Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'

John Montgomery

What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?

e.g.

import hashlib
m = hashlib.md5() # or sha1 etc
for value in array: # array contains the data
    m.update(str(value))

Don't forget though that numpy arrays won't provide __hash__() because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).

John Salvatier

There is a package for memoizing functions that use numpy arrays as inputs joblib. Found from this question.

Here is how I do it in jug (git HEAD at the time of this answer):

e = some_array_object
M = hashlib.md5()
M.update('np.ndarray')
M.update(pickle.dumps(e.dtype))
M.update(pickle.dumps(e.shape))
try:
    buffer = e.data
    M.update(buffer)
except:
    M.update(e.copy().data)

The reason is that e.data is only available for some arrays (contiguous arrays). Same thing with a.view(np.uint8) (which fails with a non-descriptive type error if the array is not contiguous).

Using Numpy 1.10.1 and python 2.7.6, you can now simply hash numpy arrays using hashlib if the array is C-contiguous (use numpy.ascontiguousarray() if not), e.g.

>>> h = hashlib.md5()
>>> arr = numpy.arange(101)
>>> h.update(arr)
>>> print(h.hexdigest())
e62b430ff0f714181a18ea1a821b0918

array.data is always hashable, because it's a buffer object. easy :) (unless you care about the difference between differently-shaped arrays with the exact same data, etc.. (ie this is suitable unless shape, byteorder, and other array 'parameters' must also figure into the hash)

Fastest by some margin seems to be: