How does a hash table work?

后端 未结 15 2289
盖世英雄少女心
盖世英雄少女心 2020-11-22 09:32

I\'m looking for an explanation of how a hash table works - in plain English for a simpleton like me!

For example, I know it takes the key, calculates the hash (I a

15条回答
  •  独厮守ぢ
    2020-11-22 10:05

    Lots of answers, but none of them are very visual, and hash tables can easily "click" when visualised.

    Hash tables are often implemented as arrays of linked lists. If we imagine a table storing people's names, after a few insertions it might be laid out in memory as below, where ()-enclosed numbers are hash values of the text/name.

    bucket#  bucket content / linked list
    
    [0]      --> "sue"(780) --> null
    [1]      null
    [2]      --> "fred"(42) --> "bill"(9282) --> "jane"(42) --> null
    [3]      --> "mary"(73) --> null
    [4]      null
    [5]      --> "masayuki"(75) --> "sarwar"(105) --> null
    [6]      --> "margaret"(2626) --> null
    [7]      null
    [8]      --> "bob"(308) --> null
    [9]      null
    

    A few points:

    • each of the array entries (indices [0], [1]...) is known as a bucket, and starts a - possibly empty - linked list of values (aka elements, in this example - people's names)
    • each value (e.g. "fred" with hash 42) is linked from bucket [hash % number_of_buckets] e.g. 42 % 10 == [2]; % is the modulo operator - the remainder when divided by the number of buckets
    • multiple data values may collide at and be linked from the same bucket, most often because their hash values collide after the modulo operation (e.g. 42 % 10 == [2], and 9282 % 10 == [2]), but occasionally because the hash values are the same (e.g. "fred" and "jane" both shown with hash 42 above)
      • most hash tables handle collisions - with slightly reduced performance but no functional confusion - by comparing the full value (here text) of a value being sought or inserted to each value already in the linked list at the hashed-to bucket

    Linked list lengths relate to load factor, not the number of values

    If the table size grows, hash tables implemented as above tend to resize themselves (i.e. create a bigger array of buckets, create new/updated linked lists there-from, delete the old array) to keep the ratio of values to buckets (aka load factor) somewhere in the 0.5 to 1.0 range.

    Hans gives the actual formula for other load factors in a comment below, but for indicative values: with load factor 1 and a cryptographic strength hash function, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.

    How a hash table can associate keys with values

    Given a hash table implementation as described above, we can imagine creating a value type such as `struct Value { string name; int age; };`, and equality comparison and hash functions that only look at the `name` field (ignoring age), and then something wonderful happens: we can store `Value` records like `{"sue", 63}` in the table, then later search for "sue" without knowing her age, find the stored value and recover or even update her age - happy birthday Sue - which interestingly doesn't change the hash value so doesn't require that we move Sue's record to another bucket.

    When we do this, we're using the hash table as an associative container aka map, and the values it stores can be deemed to consist of a key (the name) and one or more other fields still termed - confusingly - the value (in my example, just the age). A hash table implementation used as a map is known as a hash map.

    This contrasts with the example earlier in this answer where we stored discrete values like "sue", which you could think of as being its own key: that kind of usage is known as a hash set.

    There are other ways to implement a hash table

    Not all hash tables use linked lists (known as separate chaining), but most general purpose ones do, as the main alternative closed hashing (aka open addressing) - particularly with erase operations supported - has less stable performance properties with collision-prone keys/hash functions.


    A few words on hash functions

    Strong hashing...

    A general purpose, worst-case collision-minimising hash function's job is to spray the keys around the hash table buckets effectively at random, while always generating the same hash value for the same key. Even one bit changing anywhere in the key would ideally - randomly - flip about half the bits in the resultant hash value.

    This is normally orchestrated with maths too complicated for me to grok. I'll mention one easy-to-understand way - not the most scalable or cache friendly but inherently elegant (like encryption with a one-time pad!) - as I think it helps drive home the desirable qualities mentioned above. Say you were hashing 64-bit doubles - you could create 8 tables each of 256 random numbers (code below), then use each 8-bit/1-byte slice of the double's memory representation to index into a different table, XORing the random numbers you look up. With this approach, it's easy to see that a bit (in the binary digit sense) changing anywhere in the double results in a different random number being looked up in one of the tables, and a totally uncorrelated final value.

    // note caveats above: cache unfriendly (SLOW) but strong hashing...
    size_t random[8][256] = { ...random data... };
    const char* p = (const char*)&my_double;
    size_t hash = random[0][(unsigned)p[0]] ^
                  random[1][(unsigned)p[1]] ^
                  ... ^
                  random[7][(unsigned)p[7]];
    

    Weak but oft-fast hashing...

    Many libraries' hashing functions pass integers through unchanged (known as a trivial or identity hash function); it's the other extreme from the strong hashing described above. An identity hash is extremely collision prone in the worst cases, but the hope is that in the fairly common case of integer keys that tend to be incrementing (perhaps with some gaps), they'll map into successive buckets leaving fewer empty than random hashing leaves (our ~36.8% at load factor 1 mentioned earlier), thereby having fewer collisions and fewer longer linked lists of colliding elements than is achieved by random mappings. It's also great to save the time it takes to generate a strong hash, and if keys are looked up in order they'll be found in buckets nearby in memory, improving cache hits. When the keys don't increment nicely, the hope is they'll be random enough they won't need a strong hash function to totally randomise their placement into buckets.

提交回复
热议问题