when endianess does matter - cast operations [duplicate]

问题

Possible Duplicate:
When does Endianness become a factor?

reading this tuto on endianess, i fall on this example where endianess does matter. It is about writting a char* filled with 1 and 0. it can then be converted to a short, and results depends on endianess, little or big. Here is the example, quoted.

unsigned char endian[2] = {1, 0}; short x;
x = *(short *) endian;
What would be the value of x? Let's look at what this code is doing. You're creating an array of two bytes, and then casting that array of two bytes into a single short. By using an array, you basically forced a certain byte order, and you're going to see how the system treats those two bytes. If this is a little-endian system, the 0 and 1 is interpreted backwards and seen as if it is 0,1. Since the high byte is 0, it doesn't matter and the low byte is 1, so x is equal to 1. On the other hand, if it's a big-endian system, the high byte is 1 and the value of x is 256.

i wonder: when you are instantiating an array with a given number of memory bytes allocation (here, two bytes), how can conversion be done to any type (short, int...) as long as the array has been allocated the number of bytes corresponding to this byte? if not enough memory has been allocated to 'contain this type', will the next memory address still be read ? for instance if i want to cast endian to a long, will this be performed, reading four bytes from the beginning of endian, or will this fail ?

Then, a question on endianess: this is a characteristic of processor regarding habits to write bytes in memory with most significative byte at lowest memory location (big endian)or at highest memory location (little endian). in this case, an array with two one-byte element has been allocated. why is it that 1 is said the most significative byte ?

回答1:

Don’t forget that the compiler will only write assembly code. If you ignore all the warnings that the compiler, you can examine the assembly code produced by the compiler and figure out what really happens.

I took this simple program:

#include <iostream>

int main()
{
    unsigned endian[2] = { 0, 0 } ;
    long * casted_endian = reinterpret_cast<long*>( endian );
    std::cout << *casted_endian << std::endl;
}

and I extracted this code using objdump. Let’s decipher it.

 804879c:   55                      push   %ebp
 804879d:   89 e5                   mov    %esp,%ebp
 804879f:   83 e4 f0                and    $0xfffffff0,%esp
 80487a2:   83 ec 20                sub    $0x20,%esp

These are lines are just the prologue of the function, ignore them.

    unsigned endian[2] = { 0, 0 } ;
 80487a5:   c7 44 24 14 00 00 00    movl   $0x0,0x14(%esp)
 80487ac:   00 
 80487ad:   c7 44 24 18 00 00 00    movl   $0x0,0x18(%esp)
 80487b4:   00

From those 2 lines, you can see that (0x14)%esp is initialized with 0. So you know that the array endian is on the stack, at the address in the register %ESP (stack pointer) + 0x14.

    long * casted_endian = reinterpret_cast<long*>( endian );
 80487b5:   8d 44 24 14             lea    0x14(%esp),%eax

LEA is just an arithmetic operation. EAX now contains %ESP+0x14, which is the address of the array on the stack.

 80487b9:   89 44 24 1c             mov    %eax,0x1c(%esp)

And at the address ESP + 0x1c (which is the location of the variable casted_endian) we put EAX, so the address of first byte of endian.

    std::cout << *casted_endian << std::endl;
 80487bd:   8b 44 24 1c             mov    0x1c(%esp),%eax
 80487c1:   8b 00                   mov    (%eax),%eax
 80487c3:   89 44 24 04             mov    %eax,0x4(%esp)
 80487c7:   c7 04 24 40 a0 04 08    movl   $0x804a040,(%esp)
 80487ce:   e8 1d fe ff ff          call   80485f0 <std::ostream::operator<<(long)@plt>

Then we prepare the call to operator << with the relevant argument without any more checks. So that’s it, the program won’t make any more checks. The type of the variable is completely irrelevant to the machine.

Now two things can happen when operator<< will read the part of *casted_endian that are not in the array.

Either its address is in a memory page that is currently mapped, or it is not. In the first case, operator<< will read whatever is at that address without complaining. This will probably write on screen something weird. In the second case, your OS will complain about the program trying to read something that he can’t read, and provoke an interruption. This is the famous segmentation fault.

回答2:

If you try to cast to a size larger than the array, you'll get undefined behavior. It will probably try to read the contents of the memory that comes right after the array, but that result is not guaranteed and need not be consistent either.

回答3:

Oh lord. What I'm going to say here is why this works on most architectures, but I can't say how much of this is actually standard.

What you are doing there is casting the array endian to a short. Now, arrays are basically pointers, the name of the array actually holds the address of the first element. The only real difference is that arrays contain more useful metadata and some operations are different on arrays (sizeof, for example). You are then using that address (endian) and creating a short pointer from it. The memory address stays the same, it's just that you're interpreting the data pointed to differently. You're then dereferencing this pointer to get the value back out, and assign it to x.

A quick side note. This might not work on all systems. In C, int is only defined to be as wide as your architecture's native word size (4 bytes on x86, 8 on x86_64). short is then only defined to be shorter than an int (or equal to, if memory serves correctly). For this reason, that code will fail on 8-bit architectures. For this to work, the size of the target data type in bytes must be equal to or less than the size of the array.

Equally, long is just defined to be longer than an int, typically 8 or 16 bytes on x86 and x86_64, respectively. In that case, this code will work on x86:

unsigned char endian[8] = {1,2,3,4,5,6,7,8};
long x = *(long*)endian;

Anyway, the endianness of the processor depends completely on the processor. x86 is little endian (and basically started the convention of LE devices, IIRC). SPARC is big endian (until 9, which can be both). ARM and MIPS are also configurable, and Microblaze depends on the bus used (AXI or PLB). In any case, endianness is not restricted just to processors, it is also an issue when communicating with hardware or other computers.

For your final question, the most significant byte is called that because the value is represents is larger than the largest value that lesser bytes can represent. In the case of a 16-bit word, the least significant byte can represent 0-255, and the most significant byte 256-65535.

In any case, unless you're doing low level systems programming (and I mean like, directly modifying memory) or writing communications protocols, you never ever need to worry about endianness.

回答4:

unsigned char endian[2] = {1, 0};
short x;

x = *(short *) endian;

This code has undefined behavior. The result could be x set to 1, 256, 4000, or the program could crash or anything else could legally happen. This is the case even without considering whether the array is large enough for the type it's cast to.

Here's a rewrite of the code to make it legal and do what the author intended.

unsigned char endian[sizeof(short)] = {1};
short x;
std::memcpy(&x, endian, sizeof(short));

If you were to write code that tried to get an int out of that array then it would access outside the legal array bounds and you would again hit undefined behavior; anything could happen.

in this case, an array with two one-byte element has been allocated. why is it that 1 is said the most significative byte ?

(I'm guessing you mean to ask why endian[1] is said to hold the most significant byte.)

Because in that example the system is little endian and, as you say, the definition of little endian is that the most significant byte in the memory location with the highest address. endian[1] has a higher address than endian[0] so endian[1] would hold the most significant byte.

来源：https://stackoverflow.com/questions/12825632/when-endianess-does-matter-cast-operations

标签

c++

memory

types

byte

endianness