问题
Do the C++20
's strict aliasing rules [basic.lval]/11 arbitrarily allow following...
- cast between
char*
andchar8_t*
string str = "string";
u8string u8str { (char8_t*) &*str.data() }; // c++20 u8string
u8string u8str2 = u8"zß水🍌"
string str2 { (char*) u8str2.data() };
- cast between
uint32_t*
,uint_least32_t*
andchar32_t*
vector<uint32_t> ui32vec = { 0x007a, 0x00df, 0x6c34, 0x0001f34c };
u32string u32str { (char32_t*) &*ui32vec.data(), ui32vec.size() };
u32string u32str2 = U"zß水🍌"
vector<uint32_t> ui32vec2 { (uint32_t*) &*u32str2.begin(),
(uint32_t*) &*u32str2.end() };
- cast between
uint16_t*
,uint_least16_t*
andchar16_t*
vector<uint16_t> ui16vec = { 0x007a, 0x00df, 0x6c34, 0xd83c, 0xdf4c };
u16string u16str { (char16_t*) &*ui16vec.data(), ui16vec.size() };
u16string u16str2 = u"zß水\ud83c\udf4c"
vector<uint16_t> ui16vec2 { (uint16_t*) &*u16str2.begin(),
(uint16_t*) &*u16str2.end() };
Update
basic_string contructor overload (6)
template< class InputIt >
basic_string( InputIt first, InputIt last,
const Allocator& alloc = Allocator() );
vector constuctor overload (4)
template< class InputIt >
vector( InputIt first, InputIt last,
const Allocator& alloc = Allocator() );
I wonder whether it is okey to go with LegacyInputIterator constructors?...
char*
andchar8_t*
as LegacyInputIterator
string str = "string";
u8string u8str { str.begin(), str.end() };
u8string u8str { &*str.begin(), &*str.end() };
u8string u8str2 = u8"zß水🍌"
string str2 { u8str2.begin(), u8str2.end() };
string str2 { &*u8str2.begin(), &*u8str2.end() };
uint32_t*
,uint_least32_t*
andchar32_t*
as LegacyInputIterator
vector<uint32_t> ui32vec = { 0x007a, 0x00df, 0x6c34, 0x0001f34c };
u32string u32str { ui32vec.begin(), ui32vec.end() };
u32string u32str { &*ui32vec.begin(), &*ui32vec.end() };
u32string u32str2 = U"zß水🍌"
vector<uint32_t> ui32vec2 { u32str2.begin(),
u32str2.end() };
vector<uint32_t> ui32vec2 { &*u32str2.begin(),
&*u32str2.end() };
uint16_t*
,uint_least16_t*
andchar16_t*
as LegacyInputIterator
vector<uint16_t> ui16vec = { 0x007a, 0x00df, 0x6c34, 0xd83c, 0xdf4c };
u16string u16str { ui16vec.begin(), ui16vec.end() };
u16string u16str { &*ui16vec.begin(), &*ui16vec.end() };
u16string u16str2 = u"zß水\ud83c\udf4c"
vector<uint16_t> ui16vec2 { u16str2.begin(),
u16str2.end() };
vector<uint16_t> ui16vec2 { &*u16str2.begin(),
&*u16str2.end() };
回答1:
The char*_t
line of types do not have any special aliasing rules. Therefore, the standard rules apply. And those rules do not have exceptions for conversion between underlying types.
So most of what you did is UB. The one case that isn't UB is char
due to its special nature. You can in fact read the bytes of a char8_t
as an array of char
. But you can't do the opposite, reading the bytes of a char
array as char8_t
.
Now, these types are completely convertible to each other. So you can convert the values in those array to the other type anytime you want.
All that being said, on real implementations those things will almost certainly work. Well, until they don't, because you tried to change one thing through a thing that it's not supposed to be changed by, and the compiler doesn't reload the changed value because it assumed that it couldn't have been changed. So really, just use the correct, meaningful type.
回答2:
Just so we are on the same page, the C-style casts of (T*) expression
are equivalent to reinterpret_cast<T*>(expression)
([expr.cast]/4.4), which is equivalent to static_cast<T*>(static_cast<void*>(expression))
([expr.reinterpret.cast]/7). This does nothing to the value of the pointer, as they are not pointer-interconvertible. (See [expr.static.cast]/13 and [basic.compound]/4).
So yes, we would have to look at [basic.lval]/11 to see if it can be aliased. The reference must have a type which is similar to:
- the dynamic type of the object,
- a type that is the signed or unsigned type corresponding to the dynamic type of the object, or
- a
char
,unsigned char
, orstd::byte
type.
Which is not the case. Even though char8_t
has the underlying type of unsigned char
, it is not a similar type.
So, for example:
unsigned char uc = 'a';
// Represents address of uc
unsigned char* uc_ptr = &uc;
// Still holds the address of uc, not a char8_t
char8_t* c8_ptr = reinterpret_cast<char8_t*>(uc_ptr);
char8_t c8 = *c8_ptr; // UB, as `char8_t` is not `cv unsigned char`.
Though because of [basic.fundamentals]/6, which says:
A fundamental type specified to have a signed or unsigned integer type as its underlying type has the same object representation [...]
You can do reinterpret_cast<unsigned char*>(pointer-to-char8_t)
and have all the values be equal, but that is the only case (And also char*
iff char
is unsigned, otherwise they may compare unequal (Even for values < 128)). For all other types, you can use this rule to memcpy
:
// Assuming std::is_same_v<uint32_t, uint_least32_t>
vector<uint32_t> ui32vec = { 0x007a, 0x00df, 0x6c34, 0x0001f34c };
u32string u32str(ui32vec.size(), U'\x00');
std::memcpy(u32str.data(), ui32vec.data(), ui32vec.size() * sizeof(uint32_t));
u32string u32str2 = U"zß水🍌"
vector<uint32_t> ui32vec2(u32str2.size(), U'\x00');
std::memcpy(u32str2.data(), ui32vec2.data(), u32str2.size() * sizeof(uint32_t));
回答3:
C-style cast is not the same thing as reinterpret_cast
.
The standard sections I think are relevant to your question:
6.7.1.9: Type char8_t denotes a distinct type whose underlying type is unsigned char. Types char16_t and char32_t denote distinct types whose underlying types are uint_least16_t and uint_least32_t, respectively, in .
7.2.1.11: If a program attempts to access the stored value of an object through a glvalue whose type is not similar ([conv.qual]) to one of the following types the behavior is undefined:
1. the dynamic type of the object,
2. a type that is the signed or unsigned type corresponding to the dynamic type of the object, or
3. a char, unsigned char, or std::byte type.
char8_t*-->char*
Yes.
Becausechar
is one of the types that all objects can be converted to. But the standard does not guarantee that the (dereferenced) converted values are equal for distinct types.char
can be signed or not andchar8_t
is unsigned.char8_t*-->unsigned char*
is valid but should not guarantee that either because it's still distinct. But given that it'schar8_t
's underlying type it should be, I guess?char*-->char8_t*
No.
As per 6.7.1.9 those types are distinct. Although there might be argument made that "whose underlying type is unsigned char" part could apply withunsigned char
being explicitly allowed in 7.2.1.11.3 but I don't think that would be the correct interpretation and being distinct should be the deciding factor. That is supported by the following quote of a comment in the proposal P0482R6 - char8_t: A type for UTF-8 characters and strings (Revision 6 - 2018-11-09) (I did not find more recent revision):Finally, processing of UTF-8 strings is currently subject to an optimization pessimization due to glvalue expressions of type char potentially aliasing objects of other types. Use of a distinct type that does not share this aliasing behavior may allow for further compiler optimizations.
uint32_t*<-->char32_t*
,uint16_t*<-->char16_t*
,uint16_t*<-->uint_least16_t*
,uint32_t*<-->uint_least32_t*
,uint_least32_t<-->char32_t
,uint_least16_t<-->char16_t
: No.
Those pairs are all distinct, so 7.2.1.11.1 does not apply and neither type is in 7.2.1.11.3 so not even the second part of 2. can be relevant.unsigned char*-->char8_t*
No.
By the same argument as in 2. It's notT*->T*
cast which is obviously allowed.char8_t*-->unsigned char*
Yes.
Becauseunsigned char
is too one of the allowed types per 7.2.1.11.3 . But I would still argue that the standard does not guarantee that the (dereferenced) converted values will equal. But given that it's char8_t's underlying type it doesn't have any other options other than to be equal, I guess?
来源:https://stackoverflow.com/questions/56415276/do-the-strict-aliasing-rules-in-c20-allow-reinterpret-cast-between-the-stand