c++ unordered_map collision handling , resize and rehash

后端 未结 3 2209
旧时难觅i
旧时难觅i 2021-02-13 04:10

I have not read the C++ standard but this is how I feel that the unordered_map of c++ suppose to work.

  • Allocate a memory block in the heap.
  • With every p
3条回答
  •  孤街浪徒
    2021-02-13 04:29

    Allocate a memory block in the heap.

    True - there's a block of memory for an array of "buckets", which in the case of GCC are actually iterators capable of recording a place in a forward-linked list.

    With every put request, hash the object and map it to a space in this memory

    No... when you insert/emplace further items into the list, an additional dynamic (i.e. heap) allocation is done with space for the node's next link and the value being inserted/emplaced. The linked list is rewired accordingly, so the newly inserted element is linked to and/or from the other elements that hashed to the same bucket, and if other buckets also have elements, that group will be linked to and/or from the nodes for those elements.

    At some point, the hash table content might look like this (GCC does things this way, but it's possible to do something simpler):

               +------->  head
              /            |
    bucket#  /            #503
    [0]----\/              |
    [1]    /\      /===> #1003
    [2]===/==\====/        |
    [3]--/    \     /==>  #22
    [4]        \   /       |
    [5]         \ /        #7
    [6]          \         |
    [7]=========/ \-----> #177
    [8]                    |
    [9]                   #100
                       
    
    • The buckets on the left are the array from the original allocation: there are 10 elements in the illustrated array, so "bucket_count()" == 10.

    • A key with hash value X - denoted #x e.g. #177 - hashes to bucket X % bucket_count(); that bucket will need to store an iterator to the singly-linked list element immediately before the first element hashing to that bucket, so it can remove the last element from the bucket and rewire either head, or another bucket's next pointer, to skip over the erased element.

    • While elements in a bucket need to be contiguous in the forward-linked list, the ordering of buckets within that list is an unimportant consequence of the order of insertion of elements in the container, and isn't stipulated in the Standard.

    During this process handle collision handling via chaining or open addressing..

    The Standard library containers that are backed by hash tables always use separate chaining.

    I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates.

    No, the C++ Standard doesn't dictate what the initial memory allocation should be; it's up to the C++ implementation to choose. You can see how many buckets a newly created table has by printing out .bucket_count(), and in all likelihood if you multiply that by the your pointer size you'll get the size of the heap allocation that the unordered container made: myUnorderedContainer.bucket_count() * sizeof(int*). That said, there's no prohibition on your Standard Library implementation varying the initial bucket_count() in arbitrary and bizarre ways (e.g. with optimisation level, depending on Key type), but I can't imagine why any would.

    What happens if lets say we allocated 50 int memory and we ended up inserting 5000 integer? This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached.

    Rehashing/resizing isn't triggered by a certain number of collisions, but a certain proneness for collisions, as measured by the load factor, which is .size() / .bucket_count().

    When an insertion would push the .load_factor() above the .max_load_factor(), which you can change but is required by the C++ Standard to default to 1.0, then the hash table is resized. That effectively means it allocates more buckets - normally somewhere close to but not necessarily exactly twice as many - then it points the new buckets at the linked list nodes, then finally deletes the heap allocation with the old buckets.

    Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?

    There's is no C++ Standard requirement about how the resizing is implemented. That said, if I were implementing resize() I'd consider creating a function-local container whilst specifying the newly desired bucket_count, then iterate over the elements in the *this object, calling extract() to detach them, then merge() to add them to the function-local container object, then eventually invoke swap on *this and the function-local container.

提交回复
热议问题