Why does the 32769th insert fail in std::unordered_set?

问题

I generate a large number of class instances and store them in a std::unordered_set. I have defined a hash function and an equality relation, and so far everything works as it should - I insert 10000 instances with unordered_set::insert, and I can find them with unordered_set::find. All the objects are undamaged, and there is no hint on memory corruption or any other issue.

However, when I keep inserting, the 32769th insert fails - it doesn't throw, but it returns a pair where the iterator is == nullptr (0x00000000). insert is defined as:

pair<iterator, bool> insert(const value_type& Val);

and normally, the *iterator is the key I inserted, and the bool is true.
If I (after the error) try to find the object, it is in the set; if I try to insert it again, it tells me its already there; so the insert seems to have worked fine. Just the returned value is pair<nullptr,true> instead pair<iterator,bool>.
Note that if I hand-fill the iterator and continue in the debugger, the same issue happens again at the first insert after 65536, and then at 131072, etc. (so for 2^15+1, 2^16+1, 2^17+1, ...) - but not at 3 * 32768+1, etc.

To me, this looks like some short overflow. Maybe my hashes are really bad and lead to uneven filling of buckets, and at 32768 it runs out of buckets? I could not find anything more detailed about such a limit when googling, and I don't know enough about balanced trees or whatever this is internally.
Still, the std library code should be able to handle bad hashing, I understand if it gets slow and inefficient, but it shouldn't fail.

Question: Why do the 2^15th+1, 2^16th+1, etc. inserts fail, and how can I avoid it?

This is in Microsoft Visual Studio 2017 V15.7.1 (latest version as of 2018-05-15). Compiler is set to use C++2017 rules, but I doubt it makes any impact.
I cannot paste the complete code for a minimum viable solution, as the object generation is complex across multiple classes and methods, and has several hundreds lines of code, the generated hashes obviously depend on the details of the objects, and are not easily reproducible in dummy code.

### Update after one day ###: (I cannot put this in an answer, because the q was put on hold) After extensive debugging of the standard library (including a lot of head-scratching), @JamesPoag's answer turns out to point to the right thing.
After n inserts, I get:

  n     load_factor  max_load_factor  bucket_count  max_bucket_count
32766   0.999938965  1.00000000       32768         536870911 (=2^29-1)
32767   0.999969482  1.00000000       32768         536870911
32768   1.000000000  1.00000000       32768         536870911
32769   0.500000000  1.00000000       65536         536870911

not surprising, after 32768 inserts, the load factor has reached its maximum. The 32769th insert triggers a rehash to bigger table, inside the internal method _Check_Size:

void _Check_size()
        {    // grow table as needed
        if (max_load_factor() < load_factor())

            {    // rehash to bigger table
            size_type _Newsize = bucket_count();

            if (_Newsize < 512)
                _Newsize *= 8;    // multiply by 8
            else if (_Newsize < _Vec.max_size() / 2)
                _Newsize *= 2;    // multiply safely by 2
            _Init(_Newsize);
            _Reinsert();
            }
        }

at the end, _Reinsert() is called and fills all 32769 keys into the new buckets, and _sets all the _next and _prev pointers accordingly. That works fine.
However, the code that is calling those two looks like this (Plist is my set's name, this code gets generated from a template):

_Insert_bucket(_Plist, _Where, _Bucket);

_TRY_BEGIN
_Check_size();
_CATCH_ALL
erase(_Make_iter(_Plist));
_RERAISE;
_CATCH_END

return (_Pairib(_Make_iter(_Plist), true));
}

The critical point is in the last line - _Plist is used to build the pair, but it holds a now dead pointer to _next, because all bucket's addresses were rebuild in _Check_size(), some lines earlier. I think this is an error in the std library - here it needs to find _Plist in the new set, where it looks the same, but has a valid _next pointer.

An easy 'fix' is (verified to work) to expand the set right before the critical insert:
if (mySet.size() == mySet.bucket_count()) mySet.rehash(mySet.bucket_count() * 2);.

### Further Update: ### I have been trying extensively (16+ hours) to produce a minimum code that reproduces the issue, but I was not yet able to. I'll try to log the actual calculated hashes for the existing large code.
One thing I found is that one hash value of one of the keys changed (unintentionally) between being inserted and being rehashed. This might be the root cause; if I move the rehashing outside of the insert, the issue is gone.
I am not sure if there is a rule that hashes have to be constant, but it probably makes sense, how else could you find the key again.

回答1:

I plugged some simple code into godbolt.org to see what the output was, but nothing was jumping out at me.

I suspect that Value is inserted and the iterator is created, but the insertion exceeds the max_load_factor and triggers a rehash. On Rehash, the previous iterators are invalidated. The return iterator might be zeroed out in this case (or never set) (again I can't find it in the disassembly).

Check the load_value(), max_load_value() and bucket_count() before and after the offending insert.

回答2:

[this is a self-answer]
The issue is not in the standard library, as assumed, it is in my code after all (little surprise). Here is what happened:

I am inserting complex objects into the unordered_set, and the hash is calculated from the object. Let's say object 1 has the hash H1, object 2 has the hash H2, etc.
Further on, I am temporarily modifying the inserted object, cloning it, inserting the clone into the unordered_set, and undoing the modification. However, if the insert triggers a reorganization of the set (which happens at 2^15, 2^16, etc.), the hashes of all existing objects are recalculated. As object 1 is currently 'temporarily modified', its hash is not coming back as H1, but different. That messes up the internal structure of the set, and it ends up returning an invalid iterator. Pseudocode:

myMap.insert(Object1);  // hash H1 is internally calculated
Object1.DoChange();     // temporary modification
Object2 = Clone(Object1);
myMap.insert(Object2);  // <-- problem - rehashes internally and finds different hash H1 for Object1 !
Object1.UndoChange();   // too late, damage done

The problem disappears if I move the rehashing outside the insert, or if I undo the modification of the object before the critical insert (so the hash is correct again).
There are several other ways to avoid the issue (clone before modifying, save the hash value at the object and don't recalculate, etc.).

Core Lesson: Hash calculation must be stable. You cannot modify objects that are in a set or map, if it changes the calculated hash - the set or map might trigger a rehash at unexpected points of time.

来源：https://stackoverflow.com/questions/50402508/why-does-the-32769th-insert-fail-in-stdunordered-set

标签

c++

c++17

unordered-set