I\'m looking for an explanation of how a hash table works - in plain English for a simpleton like me!
For example, I know it takes the key, calculates the hash (I a
Lots of answers, but none of them are very visual, and hash tables can easily "click" when visualised.
Hash tables are often implemented as arrays of linked lists. If we imagine a table storing people's names, after a few insertions it might be laid out in memory as below, where ()
-enclosed numbers are hash values of the text/name.
bucket# bucket content / linked list
[0] --> "sue"(780) --> null
[1] null
[2] --> "fred"(42) --> "bill"(9282) --> "jane"(42) --> null
[3] --> "mary"(73) --> null
[4] null
[5] --> "masayuki"(75) --> "sarwar"(105) --> null
[6] --> "margaret"(2626) --> null
[7] null
[8] --> "bob"(308) --> null
[9] null
A few points:
[0]
, [1]
...) is known as a bucket, and starts a - possibly empty - linked list of values (aka elements, in this example - people's names)"fred"
with hash 42
) is linked from bucket [hash % number_of_buckets]
e.g. 42 % 10 == [2]
; %
is the modulo operator - the remainder when divided by the number of buckets42 % 10 == [2]
, and 9282 % 10 == [2]
), but occasionally because the hash values are the same (e.g. "fred"
and "jane"
both shown with hash 42
above)
If the table size grows, hash tables implemented as above tend to resize themselves (i.e. create a bigger array of buckets, create new/updated linked lists there-from, delete the old array) to keep the ratio of values to buckets (aka load factor) somewhere in the 0.5 to 1.0 range.
Hans gives the actual formula for other load factors in a comment below, but for indicative values: with load factor 1 and a cryptographic strength hash function, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
When we do this, we're using the hash table as an associative container aka map, and the values it stores can be deemed to consist of a key (the name) and one or more other fields still termed - confusingly - the value (in my example, just the age). A hash table implementation used as a map is known as a hash map.
This contrasts with the example earlier in this answer where we stored discrete values like "sue", which you could think of as being its own key: that kind of usage is known as a hash set.
Not all hash tables use linked lists (known as separate chaining), but most general purpose ones do, as the main alternative closed hashing (aka open addressing) - particularly with erase operations supported - has less stable performance properties with collision-prone keys/hash functions.
A general purpose, worst-case collision-minimising hash function's job is to spray the keys around the hash table buckets effectively at random, while always generating the same hash value for the same key. Even one bit changing anywhere in the key would ideally - randomly - flip about half the bits in the resultant hash value.
This is normally orchestrated with maths too complicated for me to grok. I'll mention one easy-to-understand way - not the most scalable or cache friendly but inherently elegant (like encryption with a one-time pad!) - as I think it helps drive home the desirable qualities mentioned above. Say you were hashing 64-bit double
s - you could create 8 tables each of 256 random numbers (code below), then use each 8-bit/1-byte slice of the double
's memory representation to index into a different table, XORing the random numbers you look up. With this approach, it's easy to see that a bit (in the binary digit sense) changing anywhere in the double
results in a different random number being looked up in one of the tables, and a totally uncorrelated final value.
// note caveats above: cache unfriendly (SLOW) but strong hashing...
size_t random[8][256] = { ...random data... };
const char* p = (const char*)&my_double;
size_t hash = random[0][(unsigned)p[0]] ^
random[1][(unsigned)p[1]] ^
... ^
random[7][(unsigned)p[7]];
Many libraries' hashing functions pass integers through unchanged (known as a trivial or identity hash function); it's the other extreme from the strong hashing described above. An identity hash is extremely collision prone in the worst cases, but the hope is that in the fairly common case of integer keys that tend to be incrementing (perhaps with some gaps), they'll map into successive buckets leaving fewer empty than random hashing leaves (our ~36.8% at load factor 1 mentioned earlier), thereby having fewer collisions and fewer longer linked lists of colliding elements than is achieved by random mappings. It's also great to save the time it takes to generate a strong hash, and if keys are looked up in order they'll be found in buckets nearby in memory, improving cache hits. When the keys don't increment nicely, the hope is they'll be random enough they won't need a strong hash function to totally randomise their placement into buckets.