Python, checksum of a dict

问题

I'm thinking to create a checksum of a dict to know if it was modified or not For the moment i have that:

>>> import hashlib
>>> import pickle
>>> d = {'k': 'v', 'k2': 'v2'}
>>> z = pickle.dumps(d)
>>> hashlib.md5(z).hexdigest()
'8521955ed8c63c554744058c9888dc30'

Perhaps a better solution exists?

Note: I want to create an unique id of a dict to create a good Etag.

EDIT: I can have abstract data in the dict.

回答1:

Something like this:

reduce(lambda x,y : x^y, [hash(item) for item in d.items()])

Take the hash of each (key, value) tuple in the dict and XOR them alltogether.

@katrielalex If the dict contains unhashable items you could do this:

hash(str(d))

or maybe even better

hash(repr(d))

回答2:

In Python 3, the hash function is initialized with a random number, which is different for each python session. If that is not acceptable for the intended application, use e.g. zlib.adler32 to build the checksum for a dict:

import zlib

d={'key1':'value1','key2':'value2'}
checksum=0
for item in d.items():
    c1 = 1
    for t in item:
        c1 = zlib.adler32(bytes(repr(t),'utf-8'), c1)
    checksum=checksum ^ c1

print(checksum)

回答3:

I don't know whether pickle guarantees you that the hash is serialized the same way every time.

If you only have dictionaries, I would go for o combination of calls to keys(), sorted(), build a string based on the sorted key/value pairs and compute the checksum on that

回答4:

As you said, you wanted to generate an Etag based on the dictionary content, OrderedDict which preserves the order of the dictionary may be better candidate here. Just iterator through the key,value pairs and construct your Etag string.

回答5:

I think you may not realise some of the subtleties that go into this. The first problem is that the order that items appear in a dict is not defined by the implementation. This means that simply asking for str of a dict doesn't work, because you could have

str(d1) == "{'a':1, 'b':2}"
str(d2) == "{'b':2, 'a':1}"

and these will hash to different values. If you have only hashable items in the dict, you can hash them and then join up their hashes, as @Bart does or simply

hash(tuple(sorted(hash(x) for x in d.items())))

Note the sorted, because you have to ensure that the hashed tuple comes out in the same order irrespective of which order the items appear in the dict. If you have dicts in the dict, you could recurse this, but it will be complicated.

BUT it would be easy to break any implementation like this if you allow arbitrary data in the dictionary, since you can simply write an object with a broken __hash__ implementation and use that. And you can't use id, because then you might have equal items which compare different.

The moral of the story is that hashing dicts isn't supported in Python for a reason.

来源：https://stackoverflow.com/questions/6923780/python-checksum-of-a-dict

标签

python

checksum