Python frozenset hashing algorithm / implementation

后端 未结 3 981
[愿得一人]
[愿得一人] 2020-12-08 20:33

I\'m currently trying to understand the mechanism behind the hash function defined for Python\'s built-in frozenset data type. The implementation is shown at t

3条回答
  •  遥遥无期
    2020-12-08 21:34

    Unless Raymond Hettinger (the code's author) chimes in, we'll never know for sure ;-) But there's usually less "science" in these things than you might expect: you take some general principles, and a test suite, and fiddle the constants almost arbitrarily until the results look "good enough".

    Some general principles "obviously" at work here:

    1. To get the desired quick "bit dispersion", you want to multiply by a large integer. Since CPython's hash result has to fit in 32 bits on many platforms, an integer that requires 32 bits is best for this. And, indeed, (3644798167).bit_length() == 32.

    2. To avoid systematically losing the low-order bit(s), you want to multiply by an odd integer. 3644798167 is odd.

    3. More generally, to avoid compounding patterns in the input hashes, you want to multiply by a prime. And 3644798167 is prime.

    4. And you also want a multiplier whose binary representation doesn't have obvious repeating patterns. bin(3644798167) == '0b11011001001111110011010011010111'. That's pretty messed up, which is a good thing ;-)

    The other constants look utterly arbitrary to me. The

    if h == -1:
        h = 590923713
    

    part is needed for another reason: internally, CPython takes a -1 return value from an integer-valued C function as meaning "an exception needs to be raised"; i.e., it's an error return. So you'll never see a hash code of -1 for any object in CPython. The value returned instead of -1 is wholly arbitrary - it just needs to be the same value (instead of -1) each time.

    EDIT: playing around

    I don't know what Raymond used to test this. Here's what I would have used: look at hash statistics for all subsets of a set of consecutive integers. Those are problematic because hash(i) == i for a great many integers i.

    >>> all(hash(i) == i for i in range(1000000))
    True
    

    Simply xor'ing hashes together will yield massive cancellation on inputs like that.

    So here's a little function to generate all subsets, and another to do a dirt-simple xor across all hash codes:

    def hashxor(xs):
        h = 0
        for x in xs:
            h ^= hash(x)
        return h
    
    def genpowerset(xs):
        from itertools import combinations
        for length in range(len(xs) + 1):
            for t in combinations(xs, length):
                yield t
    

    Then a driver, and a little function to display collision statistics:

    def show_stats(d):
        total = sum(d.values())
        print "total", total, "unique hashes", len(d), \
              "collisions", total - len(d)
    
    def drive(n, hasher=hashxor):
        from collections import defaultdict
        d = defaultdict(int)
    
        for t in genpowerset(range(n)):
            d[hasher(t)] += 1
        show_stats(d)
    

    Using the dirt-simple hasher is disastrous:

    >> drive(20)
    total 1048576 unique hashes 32 collisions 1048544
    

    Yikes! OTOH, using the _hash() designed for frozensets does a perfect job in this case:

    >>> drive(20, _hash)
    total 1048576 unique hashes 1048576 collisions 0
    

    Then you can play with that to see what does - and doesn't - make a real difference in _hash(). For example, it still does a perfect job on these inputs if

        h = h * 69069 + 907133923
    

    is removed. And I have no idea why that line is there. Similarly, it continues to do a perfect job on these inputs if the ^ 89869747 in the inner loop is removed - don't know why that's there either. And initialization can be changed from:

        h = 1927868237 * (n + 1)
    

    to:

        h = n
    

    without harm here too. That all jibes with what I expected: it's the multiplicative constant in the inner loop that's crucial, for reasons already explained. For example, add 1 to it (use 3644798168) and then it's no longer prime or odd, and the stats degrade to:

    total 1048576 unique hashes 851968 collisions 196608
    

    Still quite usable, but definitely worse. Change it to a small prime, like 13, and it's worse:

    total 1048576 unique hashes 483968 collisions 564608
    

    Use a multiplier with an obvious binary pattern, like 0b01010101010101010101010101010101, and worse again:

    total 1048576 unique hashes 163104 collisions 885472
    

    Play around! These things are fun :-)

提交回复
热议问题