How to hash a large object (dataset) in Python?

感情迁移 提交于 2019-11-28 04:32:40

Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
John Montgomery

What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?

e.g.

import hashlib
m = hashlib.md5() # or sha1 etc
for value in array: # array contains the data
    m.update(str(value))

Don't forget though that numpy arrays won't provide __hash__() because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).

John Salvatier

There is a package for memoizing functions that use numpy arrays as inputs joblib. Found from this question.

Here is how I do it in jug (git HEAD at the time of this answer):

e = some_array_object
M = hashlib.md5()
M.update('np.ndarray')
M.update(pickle.dumps(e.dtype))
M.update(pickle.dumps(e.shape))
try:
    buffer = e.data
    M.update(buffer)
except:
    M.update(e.copy().data)

The reason is that e.data is only available for some arrays (contiguous arrays). Same thing with a.view(np.uint8) (which fails with a non-descriptive type error if the array is not contiguous).

Using Numpy 1.10.1 and python 2.7.6, you can now simply hash numpy arrays using hashlib if the array is C-contiguous (use numpy.ascontiguousarray() if not), e.g.

>>> h = hashlib.md5()
>>> arr = numpy.arange(101)
>>> h.update(arr)
>>> print(h.hexdigest())
e62b430ff0f714181a18ea1a821b0918

array.data is always hashable, because it's a buffer object. easy :) (unless you care about the difference between differently-shaped arrays with the exact same data, etc.. (ie this is suitable unless shape, byteorder, and other array 'parameters' must also figure into the hash)

Fastest by some margin seems to be:

hash(iter(a))

a is a numpy ndarray.

Obviously not secure hashing, but it should be good for caching etc.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!