Having a variable length encoding is indirectly forbidden in the standard.
So I have several questions:
How is the following part of the standard handled?
Here's how Microsoft's STL implementation handles the variable-length encoding:
basic_string can return a low or a high surrogate, in isolation.
basic_string returns the number of wchar_t objects. A surrogate pair (one Unicode character) uses two wchar_t's and therefore adds two to the size.
basic_string can truncate a string in the middle of a surrogate pair.
basic_string can insert in the middle of a surrogate pair.
basic_string can erase either half of a surrogate pair.
In general, the pattern should be clear: the STL does not assume that a std::wstring is in UTF-16, nor enforce that it remains UTF-16.