问题
Introduction: This question is part of my collection of C and C++ (and C/C++ common subset) questions regarding the cases where pointers object with strictly identical byte-wise representation are allowed to have different "values", that is, to behave differently for some operation (including to have defined behavior on one object and undefined behavior on the other).
Following another question which caused much confusion, here is question about pointer semantics that will hopefully clear things up:
Is this program valid in all cases? The only interesting part is in the "pa1 == pb" branch.
#include <stdio.h>
#include <string.h>
int main() {
int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
if (memcmp (&pa1, &pb, sizeof pa1) == 0) {
int *p;
printf ("pa1 == pb\n"); // interesting part
memcpy (&p, &pa1, sizeof p); // make a copy of the representation
memcpy (&pa1, &p, sizeof p); // pa1 is a copy of the bytes of pa1 now
// and the bytes of pa1 happens to be the bytes of pb
*pa1 = 2; // does pa1 legally point to b?
}
else {
printf ("pa1 != pb\n"); // failed experiment, nothing to see
pa1 = &a[0]; // ensure well defined behavior in printf
}
printf ("b = %d *pa1 = %d\n", b, *pa1);
return 0;
}
I would like an answer based on standard quotes.
EDIT
By popular demand, here is what I want to know:
- is a pointer's semantic "value" (its behavior according to the specification) determined only by its numerical value (the numerical address it contains), for a pointer of a given type?
- if not, it is possible to copy only the physical address contained in a pointer while leaving out the associated semantic?
Here let's say that some one past the end pointer happens to accidentally point to another object; how can I use such one past the end pointer to access the other object?
I have the right to do anything, except use a copy of the address of the other object. (It's a game to understand pointers in C.)
IOW, I try to recycle dirty money just like the mafia. But I recycle a dirty pointer by extracting its value representation. Then it looks like the clean money, I mean pointer. Nobody can tell the difference, no?
回答1:
The question was:
Is this program valid in all cases?
The answer is "no, it is not".
The only interesting part of the program is what happens within the block guarded by the if
statement. It is somewhat difficult to guarantee the truthness of the controlling expression, so I've modified it somewhat by moving the variables to global scope. The same question remains: is this program always valid:
#include <stdio.h>
#include <string.h>
static int a[1] = { 2 };
static int b = 1;
static int *pa1 = &a[0] + 1;
static int *pb = &b;
int main(void) {
if (memcmp (&pa1, &pb, sizeof pa1) == 0) {
int *p;
printf ("pa1 == pb\n"); // interesting part
memcpy (&p, &pa1, sizeof p); // make a copy of the representation
memcpy (&pa1, &p, sizeof p); // pa1 is a copy of the bytes of pa1 now
// and the bytes of pa1 happens to be the bytes of pb
*pa1 = 2; // does pa1 legally point to b?
}
}
Now the guarding expression is true on my compiler (of course, by having these have static storage duration, a compiler cannot really prove that they're not modified by something else in the interim...)
The pointer pa1
points to just past the end of the array a
, and is a valid pointer, but must not be dereferenced, i.e. *pa1
has undefined behaviour given that value. The case is now made that copying this value to p
and back again would make the pointer valid.
The answer is no, this is still not valid, but it is not spelt out very explicitly in the standard itself. The committee response to C standard defect report DR 260 says this:
If two objects have identical bit-pattern representations and their types are the same they may still compare as unequal (for example if one object has an indeterminate value) and if one is an indeterminate value attempting to read such an object invokes undefined behavior. Implementations are permitted to track the origins of a bit-pattern and treat those representing an indeterminate value as distinct from those representing a determined value. They may also treat pointers based on different origins as distinct even though they are bitwise identical.
I.e. you cannot even draw the conclusion that if pa1
and pb
are pointers of same type and memcmp (&pa1, &pb, sizeof pa1) == 0
is true that it is also necessary pa1 == pb
, let alone that copying the bit pattern of undereferenceable pointer pa1
to another object and back again would make pa1
valid.
The response continues:
Note that using assignment or bitwise copying via
memcpy
ormemmove
of a determinate value makes the destination acquire the same determinate value.
i.e. it confirms that memcpy (&p, &pa1, sizeof p);
will cause p
to acquire the same value as pa1
, which it didn't have before.
This is not just a theoretical problem - compilers are known to track pointer provenance. For example the GCC manual states that
When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined. That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 and C11 6.5.6/8.
i.e. were the program written as:
int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
if (memcmp (&pa1, &pb, sizeof pa1) == 0) {
uintptr_t tmp = (uintptr_t)&a[0]; // pointer to a[0]
tmp += sizeof (a[0]); // value of address to a[1]
pa1 = (int *)tmp;
*pa1 = 2; // pa1 still would have the bit pattern of pb,
// hold a valid pointer just past the end of array a,
// but not legally point to pb
}
the GCC manual points out that this is explicitly not legal.
回答2:
A pointer is simply an unsigned integer whose value is the address of some location in memory. Overwriting the contents of a pointer variable is no different than overwriting the contents of normal int
variable.
So yes, doing e.g. memcpy (&p, &pa1, sizeof p)
is equivalent of the assignment p = pa1
, but might be less efficient.
Lets try it a bit differently instead:
You have pa1
which points to some object (or rather, one beyond some object), then you have the pointer &pa1
which points to the variable pa1
(i.e. the where the variable pa1
is located in memory).
Graphically it would look something like this:
+------+ +-----+ +-------+ | &pa1 | --> | pa1 | --> | &a[1] | +------+ +-----+ +-------+
[Note: &a[0] + 1
is the same as &a[1]
]
回答3:
Undefined behaviour: A play in n
parts.
Compiler1 and Compiler2 enter, stage right.
int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
[Compiler1] Hello,
a
,pa1
,b
,pb
. How very nice to make your acquaintance. Now you just sit right there, we're going to look through the rest of the code to see if we can allocate you some nice stack space.
Compiler1 looks through the rest of the code, frowning occasionally and making some markings on the paper. Compiler2 picks his nose and stares out the window.
[Compiler1] Well, I'm afraid,
b
, that I have decided to optimize you out. I simply couldn't detect somewhere which modified your memory. Maybe your programmer did some tricks with Undefined Behaviour to work around this, but I'm allowed to assume that there is no such UB present. I'm sorry.
Exit b
, pursued by a bear.
[Compiler2] Wait! Hold on a second there,
b
. I couldn't be bothered optimizing this code, so I've decided to give you a nice cosy space over there on the stack.
b
jumps in glee, but is murdered by nasal demons as soon as he is modified through undefined behaviour.
[Narrator] Thus ends the sad, sad tale of variable
b
. The moral of this story is that one can never rely on undefined behaviour.
回答4:
You have proven that it seems to work on a specific implementation. That doesn't mean that it works in general. In fact, it is undefined behavior where one possible outcome is exactly "seems to work".
If, we go back to the MS-DOS era we had near pointers (relative to a specific segment) and far pointers (containing both a segment and an offset).
Large arrays were often allocated in their own segment and only the offset was used as a pointer. The compiler already knew what segment contained a specific array, so it could combine the pointer with the proper segment register.
In that case, you could have two pointers with the same bit-pattern, where one pointer pointed into an array segment (pa
) and another pointer pointed into the stack segment (pb
). The pointers compared equal, but still pointed to different things.
To make it worse, far pointers with a segment:offset pair could be formed with overlapping segments so that different bit-patterns still pointed to the same physical memory address. For example 0100:0210
is the same address as 0120:0010
.
The C and C++ languages are designed so that this can work. That's why we have rules that comparing pointers only works (gives a total order) within the same array, and that pointers might not point to the same thing, even if they contain the same bit-pattern.
回答5:
Prior to C99, implementations were expected to behave as though the value of every variable of any type was stored a sequence of unsigned char
values; if the underlying representations of two variables of the same type were examined and found to be equal, that would imply that unless Undefined Behavior had already occurred, their values would generally be equal and interchangeable. There was a little bit of ambiguity in a couple places, e.g. given
char *p,*q;
p = malloc(1);
free(p);
q = malloc(1);
if (!memcmp(&p, &q, sizeof p))
p[0] = 1;
every version of C has made abundantly clear that q
may or may not equal to p
, and if q
isn't equal to p
code should expect that anything might happen when p[0]
is written. While the C89 Standard does not explicitly say that an implementation may only have p
compare bitwise equal to q
if a write to p
would be equivalent to a write to q
, such behavior would generally be implied by the model of variables being fully encapsulated in sequences of unsigned char
values.
C99 added a number of situations where variables may compare bitwise equal but not be equivalent. Consider, for example:
extern int doSomething(char *p1, char *p2);
int act1(char * restrict p1, char * restrict p2)
{ return doSomething(p1,p2); }
int act2(char * restrict p)
{ return doSomething(p,p); }
int x[4];
int act3a(void) { return act1(x,x); }
int act3b(void) { return act2(x); }
int act3c(void) { return doSomething(x,x); }
Calling act3a
, act3b
, or act3c
will cause doSomething()
to be invoked with two pointers that compare equal to x
, but if invoked through act3a
, any element of x
which is written within doSomething
must be accessed exclusively using x
, exclusively using p1
, or exclusively using p2
. If invoked through act3b
, the method would gain the freedom to write elements using p1
and access them via p2
or vice versa. If accessed through act3c
, the method could use p1
, p2
, and x
interchangeably. Nothing in the binary representations of p1
or p2
would indicate whether they could be used interchangeably with x
, but a compiler would be allowed to in-line expand doSomething
within act1
and act2
and have the behavior of those expansions vary according to what pointer accesses were allowed and forbidden.
回答6:
*pa1 = 2; // does pa1 legally point to b?
No, that pa1
points to b
is purely coincidental. Note that a program must conform at compilation, that the pointer happens to have the same value in runtime doesn't matter.
Nobody can tell the difference, no?
The compiler optimizer can tell the difference!
The compiler optimizer can see (through static analysis of the code) that b
and is never accessed through a "legal" pointer, so it assumes is safe to keep b
in a register. This decision is made at compilation.
Bottom line:
"Legal" pointers are pointers obtained from a legal pointer by assignment or by copying the memory. You can also obtain a "legal" pointer using pointer arithmetic, provided the resulting pointer is within the legal range of the array/memory block it was assigned/copied from. If the result of pointer arithmetic happens to point to a valid address in another memory block, the use of such a pointer is still UB.
Also note that pointer comparison is valid only if the two pointers are pointing to same array/memory block.
EDIT:
Where did it go wrong?
The standard states that accessing an array out-of-bounds results in undefined behaviour. You took the address of an out-of-bounds by one pointer, copied it and then dereferenced it.
The standard states that an out-of-bounds pointer may compare equal to a pointer to another object that happens to be placed adjacent in memory (6.5.9 pt 6). However, even though they compare equal, semantically they don't point to the same object.
In your case, you don't compare the pointers, you compare their bit patterns. Doesn't matter. The pointer pa1
is still considered to be a pointer to one past the end of an array.
Note that if you replace memcpy
with some function you write yourself, the compiler won't know what value pa1
has but it can still statically determine that it cannot contain a "legally" obtained copy of &b
.
Thus, the compiler optimizer is allowed to optimize the read/store of b
in this case.
is a pointer's semantic "value" (its behavior according to the specification) determined only by its numerical value (the numerical address it contains), for a pointer of a given type?
No. The standard infers that valid pointers can only be obtained from objects using the address-of operator (&
), by copying another valid pointer or by in/decreasing a pointer inside the bounds of an array. As a special case, pointers one past the end of an array are valid but they must not be dereferenced. This might seem a bit strict but without it the possibility to optimize would be limited.
if not, it is possible to copy only the physical address contained in a pointer while leaving out the associated semantic?
No, at least not in a way that is portable to any platform. In many implementations the pointer value is just the address. The semantics is in the generated code.
回答7:
No. We cannot even infer that either branch of this code works given any particular result of memcmp()
. The object representations that you compare with memcmp()
might be different even if the pointers would be equivalent, and the pointers might be different even if the object representations match. (I’ve changed my mind about this since I originally posted.)
You try to compare an address one-past-the-end of an array with the address of an object outside the array. The Standard (§6.5.8.5 of draft n1548, emphasis added) has this to say:
When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. If two pointers to object types both point to the same object, or both point one past the last element of the same array object, they compare equal. If the objects pointed to are members of the same aggregate object, pointers to structure members declared later compare greater than pointers to members declared earlier in the structure, and pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values. All pointers to members of the same union object compare equal. If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P. In all other cases, the behavior is undefined.
It repeats this warning that the result of comparing the pointers is undefined, in appendix J.
Also undefined behavior:
An object which has been modified is accessed through a restrict qualified pointer to a const-qualified type, or through a restrict-qualified pointer and another pointer that are not both based on the same object
However, none of the pointers in your program are restrict-qualified. Neither do you do illegal pointer arithmetic.
You try to get around this undefined behavior by using memcmp()
instead. The relevant part of the specification (§7.23.4.1) says:
The
memcmp
function compares the firstn
characters of the object pointed to bys1
to the firstn
characters of the object pointed to bys2
.
So, memcmp()
compares the bits of the object representations. Already, the bits of pa1
and pb
will be the same on some implementations, but not others.
§6.2.6.1 of the Standard makes the following guarantee:
Two values (other than NaNs) with the same object representation compare equal, but values that compare equal may have different object representations.
What does it mean for pointer values to compare equal? §6.5.9.6 tells us:
Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.
That last clause, I think, is the clincher. Not only can two pointers that compare equal have different object representations, but two pointers with the same object representation might not be equivalent if one of them is a one-past-the-end pointer like &a[0]+1
and another is a pointer to an object outside the array, like &b
. Which is exactly the case here.
回答8:
I say no, without resorting to the UB tarpit. From the following code:
extern int f(int x[3], int y[4]);
....
int a[7];
return f(a, a) + f(a+4, a+3);
...
The C standard should not prevent me from writing a compiler which performs bounds checking; there are several available. A bounds checking compiler would have to fatten the pointers by augmenting them with bounds information (*). So when we get to f():
....
if (x == y) {
....
F() would be interested in the C notion of equality, that is do they point at the same location, not do they have identical types. If you aren’t happy with this, suppose f() called g(int *s, int *t), and it contained a similar test. The compiler would perform the comparison without comparing the fat.
The pointer size sizeof(int *), would have to include the fat, so memcmp of two pointers would compare it as well, thus providing a different result from the compare.
- = Yes, you could store such info in a dynamic associative array; which could result in the program aborting because of resource shortfalls, and may introduce tracking problems with memcpy, alloc & free.
PS: should we introduce a new tag for navel gazing?
回答9:
The question, as I understand it, is:
Is memcpy of a pointer the same as assignment?
And my answer would be, yes.
memcpy
is basically an optimized assignment for variable length data that has no memory alignment requirements. It's pretty much the same as:
void slow_memcpy(void * target, void * src, int len) {
char * t = target;
char * s = src;
for (int i = 0; i < len; ++i)
{
t[i] = s[i];
}
}
is a pointer's semantic "value" (its behavior according to the specification) determined only by its numerical value (the numerical address it contains), for a pointer of a given type?
Yes. There are no hidden data fields is C, so the pointer's behavior is totally dependant on it's numerical data content.
However, pointer arithmetics is resolved by the compiler and depends on the pointer's type.
A char * str
pointer arithmetics will be using char
units (i.e., str[1]
is one char
away from str[0]
), while an int * p_num
pointer arithmetics will be using int
units (i.e., p_num[1]
is one int
away from p_num[0]
).
Are two pointers with identical bit patterns allowed to have different behavior? (edit)
Yes and no.
They point to the same location in the memory and in this sense they are identical.
However, pointer resolution might depend on the pointer's type.
For example, by dereferencing a uint8_t *
, only 8 bits are read from the memory (usually). However, when dereferencing a uint64_t *
, 64 bits are read from the memory address.
Another difference is pointer arithmetics, as described above.
However, when using functions such as memcpy
or memcmp
, than the pointers will behave the same.
So why does everybody say "No"?
Well, that's because the code in your example doesn't reflect the question in the title. The code’s behavior is undefined, as clearly explained by the many answers.
(edit):
The issues with the code have little to do with the actual question.
Consider, for example, the following line:
int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
In this case, pa
points to a[1]
, which is out of bounds.
This pretty much throws the code into undefined behavior territory, which distracted many answers away from the actual question.
来源:https://stackoverflow.com/questions/32048698/is-memcpy-of-a-pointer-the-same-as-assignment