I want to compute an md5 hash not of a string, but of an entire data structure. I understand the mechanics of a way to do this (dispatch on the type of the value, canonical
bencode sorts dictionaries so:
import hashlib
import bencode
data = ['only', 'lists', [1,2,3],
'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(bencode.bencode(data)).hexdigest()
print data_md5
prints:
af1b88ca9fd8a3e828b40ed1b9a2cb20
I ended up writing it myself as I thought I would have to:
class Hasher(object):
"""Hashes Python data into md5."""
def __init__(self):
self.md5 = md5()
def update(self, v):
"""Add `v` to the hash, recursively if needed."""
self.md5.update(str(type(v)))
if isinstance(v, basestring):
self.md5.update(v)
elif isinstance(v, (int, long, float)):
self.update(str(v))
elif isinstance(v, (tuple, list)):
for e in v:
self.update(e)
elif isinstance(v, dict):
keys = v.keys()
for k in sorted(keys):
self.update(k)
self.update(v[k])
else:
for k in dir(v):
if k.startswith('__'):
continue
a = getattr(v, k)
if inspect.isroutine(a):
continue
self.update(k)
self.update(a)
def digest(self):
"""Retrieve the digest of the hash."""
return self.md5.digest()
ROCKY way: Put all your struct items in one parent entity (if not already), recurse and sort/canonicalize/etc them, then calculate the md5 of its repr
.
UPDATE: this won't work for dictionaries due to key order randomness. Sorry, I've not thought of it.
import hashlib
import cPickle as pickle
data = ['anything', 'you', 'want']
data_pickle = pickle.dumps(data)
data_md5 = hashlib.md5(data_pickle).hexdigest()
This should work for any python data structure, and for objects as well.
json.dumps() can sort dictionaries by key. So you don't need other dependencies:
import hashlib
import json
data = ['only', 'lists', [1,2,3], 'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()
print(data_md5)
Prints:
87e83d90fc0d03f2c05631e2cd68ea02
While it does require a dependency on joblib
, I've found that joblib.hashing.hash(object) works very well and is designed for use with joblib
's disk caching mechanism. Empirically it seems to be producing consistent results from run to run, even on data that pickle
mixes up on different runs.
Alternatively, you might be interested in artemis-ml
's compute_fixed_hash function, which theoretically hashes objects in a way that is consistent across runs. However, I've not tested it myself.
Sorry for posting millions of years after the original question