ASCII strings and endianness

我与影子孤独终老i 提交于 2019-11-28 15:58:57

Without a doubt, you are correct.

ANSI C standard 6.1.4 specifies that string literals are stored in memory by "concatenating" the characters in the literal.

ANSI standard 6.3.6 also specifies the effect of addition on a pointer value:

When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression.

If the idea attributed to this person were correct, then the compiler would also have to monkey around with integer math when the integers are used as array indices. Many other fallacies would also result which are left to the imagination.

The person may be confused, because (unlike a string initializer), multi-byte chacter constants such as 'ABCD' are stored in endian order.

There are many reasons a person might be confused about this. As others have suggested here, he may be misreading what he sees in a debugger window, where the contents have been byte-swapped for readability of int values.

The professor is confused. In order to see something like 'P-yM azzi' you need to take some memory inspection tool that displays memory in '4-byte integer' mode and at the same time gives you a "character interpretation" of each integer in higher-order byte to lower-order byte mode.

This, of course, has nothing to do with the string itself. And to say that the string itself is represented that way on a little-endian machine is utter nonsense.

The professor is wrong if we're talking about a system that uses 8 bits per character.

I often work with embedded systems that actually use 16-bit characters, each word being little-endian. On such a system, the string "My-Pizza" would indeed be stored as "yMP-ziaz".

But as long as it's an 8-bit-per-character system, the string will always be stored as "My-Pizza" independent of the endian-ness of the higher-level architecture.

Endianness defines the order of bytes within multi-byte values. Character strings are arrays of single-byte values. So each value (character in the string) is the same on both little-endian and big-endian architectures, and endianness does not affect the order of values in a structure.

You can quite easily prove that the compiler is doing no such "magic" transformations, by doing the printing in a function that doesn't know it's been passed a string:

int foo(const void *mem, int n)
{
    const char *cptr, *end;
    for (cptr = mem, end = cptr + n; cptr < end; cptr++)
        printf("%p : %c\n", cptr, *cptr);
}

int main()
{
    const char* s = "My-Pizza";

    foo(s, strlen(s));
    foo(s + 1, strlen(s) - 1);
}

Alternatively, you can even compile to assembly with gcc -S and conclusively determine the absence of magic.

But shockingly, the intern claims his professor insists that the string would be represented as:

P-yM azzi

It would be represented as, represented as what? represented to user as 32bit integer dump? or represented/layout in computer's memory as P-yM azzi?

If the professor said "My-Pizza" would be represented/layout as "P-yM azzi" in computer's memory because the computer is of little endian architecture, somebody, please, got to teach that professor how to use a debugger! I think that's where all the professor's confusions stems from, I have an inkling that the professor is not a coder(not that I'm looking down upon the professor), I think he don't have a way to prove in code what he learned about endian-ness.

Maybe the professor learned the endian-ness stuff just about a week ago, then he just use a debugger incorrectly, quickly delighted about his newly unique insight on computers and then preach it to his students immediately.

If the professor said endian-ness of machine has a bearing on how ascii strings would be represented in memory, he need to clean up his act, somebody should correct him.

If the professor gave an example instead on how integers are represented/layout in machines differently depending on machine's endianness, his students could appreaciate what he is teaching all about.

I assume the professor was trying to make a point by analogy about the endian/NUXI problem, but you're right when you apply it to actual strings. Don't let that derail from the fact that he was trying to teach students a point and how to think about a problem a certain way.

You may be interested, it is possible to emulate a little-endian architecture on a big-endian machine, or vice-versa. The compiler has to emit code which auto-magically messes with the least significant bits of char* pointers whenever it dereferences them: on a 32bit machine you'd map 00 <-> 11 and 01 <-> 10.

So, if you write the number 0x01020304 on a big-endian machine, and read back the "first" byte of that with this address-munging, then you get the least significant byte, 0x04. The C implementation is little-endian even though the hardware is big-endian.

You need a similar trick for short accesses. Unaligned accesses (if supported) may not refer to adjacent bytes. You also can't use native stores for types bigger than a word because they'd appear word-swapped when read back one byte at a time.

Obviously however, little-endian machines do not do this all the time, it's a very specialist requirement and it prevents you using the native ABI. Sounds to me as though the professor thinks of actual numbers as being "in fact" big-endian, and is deeply confused what a little-endian architecture really is and/or how its memory is being represented.

It's true that the string is "represented as" P-yM azzi on 32bit l-e machines, but only if by "represented" you mean "reading the words of the representation in order of increasing address, but printing the bytes of each word big-endian". As others have said, this is what some debugger memory views might do, so it is indeed a representation of the contents of the memory. But if you're going to represent the individual bytes, then it is more usual to list them in order of increasing address, no matter whether words are stored b-e or l-e, rather than represent each word as a multi-char literal. Certainly there is no pointer-fiddling going on, and if the professor's chosen representation has led him to think that there is some, then it has misled him.

Also, (And I haven't played with this in a long time, so I might be wrong) He might be thinking of pascol, where strings are represented as "packed arrays" which, IIRC are characters packed into 4 byte integers?

AFAIK, endianness only makes sense when you want to break a large value into small ones. Therefore I don't think that C-style string are affected with it. Because they are after all just arrays of characters. When you are reading only one byte, how could it matter if you read it from left or right?

It's hard to read the prof's mind and certainly the compiler is not doing anything other than storing bytes to adjacent increasing addresses on both BE and LE systems, but it is normal to display memory in word-sized numbers, for whatever the word size is, and we write one thousand as 1,000. Not 000,1.

$ cat > /tmp/pizza
My-Pizza^D
$ od -X /tmp/pizza
0000000 502d794d 617a7a69
0000010
$ 

For the record, y == 79, M == 4d.

I came across this and felt the need to clear it up. No one here seems to have addressed the concept of bytes and words or how to address them. A byte is 8-bits. A word is a collection of bytes.

If the computer is:

  • byte addressable
  • with 4-byte (32-bit) words
  • word aligned
  • the memory is viewed "physically" (not dumped and byte-swapped)

then indeed, the professor would be correct. His failure to indicate this proves he doesn't exactly know what he is talking about, but he did understand the basic concept.

Byte Order Within Words: (a) Big Endian, (b) Little Endian

Character and Integer Data in Words: (a) Big Endian, (b) Little Endian

References

Does the professor's "C" code look anything like this? If so, he needs to update his compiler.

main() {
    extrn putchar;
    putchar('Hell');
    putchar('o, W');
    putchar('orld');
    putchar('!*n');
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!