Linear probing huge sequences of keys with unequal hash

问题

There is one thing about linear probing (hash tables) that is not intuitive to me. If I put key1 which hash results to array index 1. Then I put key2 -> array index 2. Then I put key3 -> again array index 1, this will go to array index 3. Then when I search for key3 I should go through indexes that contain keys that do not have the same hash as mine at all. Isn't this waste? If the sequence is really big and contains many keys (for example I have 20 elements, then null, for any key that results in array index from 0 to 20 I must go through all the indexes although they do not have the same hash as mine and I can eliminate this with separate chaining).

Or this is mitigated by the fact that our hashing function (if written well enough) distributes the keys equally among the indices and we constantly resize the array to be max half full?

回答1:

Linear probing is sub-optimal when there are many collisions. Note that the number of collisions doesn't only depend on the hash, but also on the number of slots in the table (usually a prime number) because the index is the remainder of the integer division of the hash by the table length.

Note however, than having the colliding keys one next to the other might also take advantage of the CPU caches, which will bring from RAM many elements in one read. So, do not (in principle) think that the time it takes to check 20 probes is 20 times the time it takes to check one, because what happens inside the CPU and its caches is much faster than going to RAM. There is no magic though. If the computation of every comparison throws away what's in the caches, then part of the savings will be lost.

回答2:

The issue you've identified is indeed something that can impact the performance of linear probing. When you try to look up some element, you may have to look pretty far from where the initial hash probe started in order to find your element.

That being said, linear probing is extremely fast in practice, and that's mostly due to locality of reference. The cost of looking up something in memory is not uniform - if you look up an address near something you've already read recently, chances are that memory region has been pulled into cache and the cost of looking things up is extremely low. As a result, the cost of those probes in practice is often less than you might naturally expect, since those probes are probably pretty quick.

However, this doesn't mean you can ignore this fact. There are a number of issues to keep an eye out for. First, as the load factor of the table increases, there comes a point where the cost of hitting other elements starts to make lookups take progressively longer and longer. Usually, you'll see people rehash into a bigger table at around a load factor of 75%. Second, you need to have a pretty good hash function, since if you have a low-quality hash that drops lots of elements into similar locations, you'll get really terrible performance for the reason you've mentioned.

There are a couple of techniques you can use to mitigate this. Robin Hood hashing works by moving around elements after they've been placed down so that elements that are closer to home get pushed further away to make room for elements that are closer to home. This makes the average cost of a lookup a bit higher, but dramatically reduces the worst case cost of a lookup (in other words, it reduces the variance of a lookup cost in exchange for increasing the expected value of that lookup cost). Hopscotch hashing works by capping the maximum distance that elements can be displaced and maintaining a bitmask indicating which elements nearby might be a match, reducing the amount of work that you need to do to find things. And the new Google flat_map starts with linear probing and uses really clever hashing and parallel memory operations to make lookups extremely fast.

来源：https://stackoverflow.com/questions/53797262/linear-probing-huge-sequences-of-keys-with-unequal-hash

标签

algorithm

performance

hash

hashtable

linear-probing