C++ - Why is boost::hash_combine the best way to combine hash-values?

后端 未结 3 2026
臣服心动
臣服心动 2020-11-30 22:41

I\'ve read in other posts that this seems to be the best way to combine hash-values. Could somebody please break this down and explain why this is the best way to do it?

3条回答
  •  孤独总比滥情好
    2020-11-30 22:51

    It's not the best, surprisingly to me it's not even particularily good. The main problem is the bad distribution, which is not really the fault of boost::hash_combine in itself, but in conjunction with a badly distributing hash like std::hash which is most commonly implemented with the identity function.

    Figure 2: The effect of a single bit change in one of two random 32 bit numbers on the result of boost::hash_combine

    To demonstrate how bad things can become these are the collisions for points on a 32x32 grid when using hash_combine as intended, and with std::hash:

    # hash      x₀   y₀  x₁  y₁ ...
    3449074105  6   30   8  15
    3449074104  6   31   8  16
    3449074107  6   28   8  17
    3449074106  6   29   8  18
    3449074109  6   26   8  19
    3449074108  6   27   8  20
    3449074111  6   24   8  21
    3449074110  6   25   8  22
    

    For a well distributed hash there should be none, statistically. Using bit-rotations instead of bit-shifts and xor instead of addition one could easily create a similar hash_combine that preserves entropy better. But really what you should do is use a good hash function in the first place, then after that a simple xor is sufficient to combine the seed and the hash.

    #include 
    #include 
    
    template
    T xorshift(const T& n,int i){
      return n^(n>>i);
    }
    
    uint32_t distribute(const uint32_t& n){
      uint32_t p = 0x55555555ul; // pattern of alternating 0 and 1
      uint32_t c = 3423571495ul; // random uneven integer constant; 
      return c*xorshift(p*xorshift(n,16),16);
    }
    
    uint64_t hash(const uint64_t& n){
      uint64_t p = 0x5555555555555555;     // pattern of alternating 0 and 1
      uint64_t c = 17316035218449499591ull;// random uneven integer constant; 
      return c*xorshift(p*xorshift(n,32),32);
    }
    
    // if c++20 rotl is not available:
    template 
    typename std::enable_if::value,T>::type
    constexpr rotl(const T n, const S i){
      const T m = (std::numeric_limits::digits-1);
      const T c = i&m;
      return (n<>((T(0)-c)&m)); // this is usually recognized by the compiler to mean rotation, also c++20 now gives us rotl directly
    }
    
    template 
    inline size_t hash_combine(std::size_t& seed, const T& v)
    {
        return rotl(seed,std::numeric_limits::digits/3) ^ distribute(std::hash(v));
    }
    

    The seed is rotated once before combining it to make the order in which the hash was computed relevant.

    The hash_combine from boost needs two operations less, and more importantly no multiplications, in fact it's about 5x faster, but at about 2 cyles per hash on my machine the proposed solution is still very fast and pays off quickly when used for a hash table. There are 118 collisions on a 1024x1024 grid (vs. 982017 for boosts hash_combine + std::hash), about as many as expected for a well distributed hash function and that is all we can ask for.

    Now even when used in conjunction with a good hash function boost::hash_combine is not ideal. If all entropy is in the seed at some point some of it will get lost. There are 2948667289 distinct results of boost::hash_combine(x,0), but there should be 4294967296 .

    In conclusion, they tried to create a hash function that does both, combining and cascading, and fast, but ended up with something that does both just good enough to not be recognised as bad immediately.

提交回复
热议问题