Counting collisions in a Python dictionary

前端 未结 3 1665
眼角桃花
眼角桃花 2020-12-28 19:09

my first time posting here, so hope I\'ve asked my question in the right sort of way,

After adding an element to a Python dictionary, is it possible to get Python to

3条回答
  •  盖世英雄少女心
    2020-12-28 19:43

    Short answer:

    You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.

    Collisions do happen. "Having unique thingies means no collisions" is wrong for several values of "thingy".

    You shouldn't be worrying about collisions.

    Long answer:

    Some explanations, derived from reading the source code:

    A dict is implemented as a table of 2 ** i entries, where i is an integer.

    dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.

    When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer

    It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.

    Consequently collisions are inevitable.

    However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.

    The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).

    In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.

    Update after testing on Python 2.6:

    Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".

    >>> n = 15000
    >>> i = 0
    >>> while 2 ** i / 1.5 < n:
    ...    i += 1
    ...
    >>> print i, 2 ** i, int(2 ** i / 1.5)
    15 32768 21845
    >>> probe_mask = 2 ** i - 1
    >>> print hex(probe_mask)
    0x7fff
    >>> class Foo(object):
    ...     pass
    ...
    >>> olist = [Foo() for j in xrange(n)]
    >>> hashes = [hash(o) for o in olist]
    >>> print len(set(hashes))
    15000
    >>> probes = [h & probe_mask for h in hashes]
    >>> print len(set(probes))
    12997
    >>>
    

提交回复
热议问题