Computing an md5 hash of a data structure

后端 未结 7 1755
南旧
南旧 2020-12-04 10:13

I want to compute an md5 hash not of a string, but of an entire data structure. I understand the mechanics of a way to do this (dispatch on the type of the value, canonical

相关标签:
7条回答
  • 2020-12-04 10:34

    bencode sorts dictionaries so:

    import hashlib
    import bencode
    data = ['only', 'lists', [1,2,3], 
    'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
    data_md5 = hashlib.md5(bencode.bencode(data)).hexdigest()
    print data_md5
    

    prints:

    af1b88ca9fd8a3e828b40ed1b9a2cb20
    
    0 讨论(0)
  • 2020-12-04 10:43

    I ended up writing it myself as I thought I would have to:

    class Hasher(object):
        """Hashes Python data into md5."""
        def __init__(self):
            self.md5 = md5()
    
        def update(self, v):
            """Add `v` to the hash, recursively if needed."""
            self.md5.update(str(type(v)))
            if isinstance(v, basestring):
                self.md5.update(v)
            elif isinstance(v, (int, long, float)):
                self.update(str(v))
            elif isinstance(v, (tuple, list)):
                for e in v:
                    self.update(e)
            elif isinstance(v, dict):
                keys = v.keys()
                for k in sorted(keys):
                    self.update(k)
                    self.update(v[k])
            else:
                for k in dir(v):
                    if k.startswith('__'):
                        continue
                    a = getattr(v, k)
                    if inspect.isroutine(a):
                        continue
                    self.update(k)
                    self.update(a)
    
        def digest(self):
            """Retrieve the digest of the hash."""
            return self.md5.digest()
    
    0 讨论(0)
  • 2020-12-04 10:47

    ROCKY way: Put all your struct items in one parent entity (if not already), recurse and sort/canonicalize/etc them, then calculate the md5 of its repr.

    0 讨论(0)
  • 2020-12-04 10:50

    UPDATE: this won't work for dictionaries due to key order randomness. Sorry, I've not thought of it.

    import hashlib
    import cPickle as pickle
    data = ['anything', 'you', 'want']
    data_pickle = pickle.dumps(data)
    data_md5 = hashlib.md5(data_pickle).hexdigest()
    

    This should work for any python data structure, and for objects as well.

    0 讨论(0)
  • 2020-12-04 10:58

    json.dumps() can sort dictionaries by key. So you don't need other dependencies:

    import hashlib
    import json
    
    data = ['only', 'lists', [1,2,3], 'dictionaries', {'a':0,'b':1}, 'numbers', 47, 'strings']
    data_md5 = hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()
    
    print(data_md5)
    

    Prints:

    87e83d90fc0d03f2c05631e2cd68ea02
    
    0 讨论(0)
  • 2020-12-04 10:59

    While it does require a dependency on joblib, I've found that joblib.hashing.hash(object) works very well and is designed for use with joblib's disk caching mechanism. Empirically it seems to be producing consistent results from run to run, even on data that pickle mixes up on different runs.

    Alternatively, you might be interested in artemis-ml's compute_fixed_hash function, which theoretically hashes objects in a way that is consistent across runs. However, I've not tested it myself.

    Sorry for posting millions of years after the original question

    0 讨论(0)
提交回复
热议问题