Computing an md5 hash of a data structure

后端未结

关注

 7  1767

I want to compute an md5 hash not of a string, but of an entire data structure. I understand the mechanics of a way to do this (dispatch on the type of the value, canonical

相关标签:

7条回答

眼角桃花

2020-12-04 10:34

bencode sorts dictionaries so:

import hashlib
import bencode
data = ['only', 'lists', [1,2,3], 
'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(bencode.bencode(data)).hexdigest()
print data_md5

prints:

af1b88ca9fd8a3e828b40ed1b9a2cb20

0 讨论(0)

粉色の甜心

2020-12-04 10:43

I ended up writing it myself as I thought I would have to:

class Hasher(object):
    """Hashes Python data into md5."""
    def __init__(self):
        self.md5 = md5()

    def update(self, v):
        """Add `v` to the hash, recursively if needed."""
        self.md5.update(str(type(v)))
        if isinstance(v, basestring):
            self.md5.update(v)
        elif isinstance(v, (int, long, float)):
            self.update(str(v))
        elif isinstance(v, (tuple, list)):
            for e in v:
                self.update(e)
        elif isinstance(v, dict):
            keys = v.keys()
            for k in sorted(keys):
                self.update(k)
                self.update(v[k])
        else:
            for k in dir(v):
                if k.startswith('__'):
                    continue
                a = getattr(v, k)
                if inspect.isroutine(a):
                    continue
                self.update(k)
                self.update(a)

    def digest(self):
        """Retrieve the digest of the hash."""
        return self.md5.digest()

0 讨论(0)

爱一瞬间的悲伤

2020-12-04 10:47

ROCKY way: Put all your struct items in one parent entity (if not already), recurse and sort/canonicalize/etc them, then calculate the md5 of its repr.

0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2020-12-04 10:50
UPDATE: this won't work for dictionaries due to key order randomness. Sorry, I've not thought of it.
```
import hashlib
import cPickle as pickle
data = ['anything', 'you', 'want']
data_pickle = pickle.dumps(data)
data_md5 = hashlib.md5(data_pickle).hexdigest()
```
This should work for any python data structure, and for objects as well.
0 讨论(0)
发布评论:

提交评论
- 加载中...

抹茶落季

2020-12-04 10:58

json.dumps() can sort dictionaries by key. So you don't need other dependencies:

import hashlib
import json

data = ['only', 'lists', [1,2,3], 'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
data_md5 = hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()

print(data_md5)

Prints:

87e83d90fc0d03f2c05631e2cd68ea02

0 讨论(0)

我在风中等你

2020-12-04 10:59

While it does require a dependency on joblib, I've found that joblib.hashing.hash(object) works very well and is designed for use with joblib's disk caching mechanism. Empirically it seems to be producing consistent results from run to run, even on data that pickle mixes up on different runs.

Alternatively, you might be interested in artemis-ml's compute_fixed_hash function, which theoretically hashes objects in a way that is consistent across runs. However, I've not tested it myself.

Sorry for posting millions of years after the original question

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页