hashing different tuples in python give identical result

冷暖自知 提交于 2020-05-10 09:54:16

问题


I'm working with sets of integer matrices, and I thought representing them as tuples made sense, as they are hashable. However the hash() function gave me strange results for tuples:

hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))

Out[147]: -697649482279922733

hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))

Out[148]: -697649482279922733

As you can see, these two different tuples have the same hash value. Note that they are actually pretty similar (exchange of the first and last subtuples), however I couldn't find a more minimal example: ((0,1),(0,0)) and ((0,0),(0,1)) have different hash values for example.

Any clue of what's going on? I can't believe that it's just incredibly bad luck... Now that I have found where the problem is I could bypass it easily, but I thought it was worth mentioning here anyway.


回答1:


Until Python 3.8, the hash of a tuple is based on the hashes of the content using the following formula (from the tuplehash() function):

Py_uhash_t mult = _PyHASH_MULTIPLIER; /* defined as 1000003UL == 0xf4243 */
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
    y = PyObject_Hash(*p++);
    if (y == -1)
        return -1;
    x = (x ^ y) * mult;
    /* the cast might truncate len; that doesn't change hash stability */
    mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
    x = -2;
return x;

This is a method known as the FNV-1 (Fowler / Noll / Vo) hash method.

As it happens, that formula produces the exact same output for (1, 0, -1) and (1, -1, 0):

>>> hash((1, -1, 0))
-2528505496374624146
>>> hash((1, 0, -1))
-2528505496374624146

because the hashes for the 3 contained integers are 1, 0 and -2:

>>> hash(1)
1
>>> hash(0)
0
>>> hash(-1)
-2

and swapping the 0 and the -2 has no actual influence on the outcome.

So the hashes for the 3 contained tuples don't change between the two examples, so the final hash doesn't change either.

This is just a coincidence, and I'd expect that in practice this doesn't happen all that often and dictionaries and sets already can handle collisions just fine.

However, a few years after writing my original answer, it turns out that my expectation was misplaced! The above tuplehash() implementation was in place for 14 years, until someone insisted that there was a problem with the scheme. It turns out that certain value combinations (such as 4 and -4, or 0.25 and 0.5) drastically reduced the possible hash values the method could output:

>>> import sys; from itertools import product
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=7, releaselevel='final', serial=0)
>>> values = (0.25, 0.5)
>>> sum(1 for _ in product(values, repeat=20))  # 20 elements in each tuple
1048576
>>> len(set(map(hash, product(values, repeat=20))))
32

The above creates all 1048576 (2 ** 20 == 1024 ** 2) possible 20-value tuples that combine 0.25 and 0.5. Ideally, they all should have a different hash value, or at least have a very large number of different hash values. But the above tuplehash() function only produced 32 unique values. Each of those 32 unique hashes applies to 32768 (2 ** 15) such combinations:

>>> from collections import Counter
>>> Counter(Counter(map(hash, product(values, repeat=20))).values())
Counter({32768: 32})

This is actually quite a big problem! The above issue also comes into play for 1, -1, 0, it just isn't as pronounced; testing here with 3 ** 12 == 531441 combinations:

>>> values = (1, -1, 0)
>>> sum(1 for _ in product(values, repeat=12))
531441
>>> len(set(map(hash, product(values, repeat=12))))
238605
>>> Counter(Counter(map(hash, product(values, repeat=12))).values())
Counter({1: 153005, 2: 51006, 4: 21730, 8: 8424, 16: 3012, 32: 994, 64: 314, 128: 92, 256: 20, 512: 6, 1024: 2})

so 153005 of the hashes produced for those 12-element tuples are all using a single hash.

So in Python 3.8, the implementation was switched from FNV-1a to an adaptation of the xxHash fast digest scheme. See the new tuplehash() function implementation for details.

This new method performs great on the examples from your question:

>>> sys.version_info
sys.version_info(major=3, minor=8, micro=1, releaselevel='final', serial=0)
>>> hash((1, -1, 0))
426056430309831993
>>> hash((1, 0, -1))
-7823806182320511195
>>> hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))
-6252168277346219339
>>> hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))
-5221381175350594014

as well as the pathological cases I discussed above:

>>> values = (0.25, 0.5)
>>> len(set(map(hash, product(values, repeat=20))))
1048576
>>> values = (1, -1, 0)
>>> len(set(map(hash, product(values, repeat=12))))
531441



回答2:


Seems strange, but don't use hash either way: https://docs.python.org/2/library/functions.html#hash

[hash is] used to quickly compare dictionary keys during a dictionary lookup.

It's not actually made for general purpose hashing - dictionaries have extra checks beyond simple hash equality. For general purpose hashing, use hashlib



来源:https://stackoverflow.com/questions/32504977/hashing-different-tuples-in-python-give-identical-result

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!