How can a file contain null bytes?

前端 未结 6 1902
一整个雨季
一整个雨季 2021-02-03 23:17

How is it possible that files can contain null bytes in operating systems written in a language with null-terminating strings (namely, C)?

For example, if I run this she

6条回答
  •  情深已故
    2021-02-04 00:10

    Before answering anything, please note that

    (note: according to n.m. (see comment's in OP) "a Byte is the smallest quantity available to write out to disk with the C standard library, non-standard libraries may well deal with bits or anything else." So what I said below about WORD sizes being the smallest quantity is probably not very true, but still provides insight nonetheless).

    NULL is always 0_decimal (practically)

    dec: 0
    hex: 0x00000000
    bin: 00000000 00000000 00000000 00000000
    

    although it's actual value is defined by a programming language's specification, so use defined constant NULL instead of hardcoding 0 everywhere (in case it changes, when hell freezes over).

    ASCII encoding for character '0' is 48_decimal

    dec: 48
    hex: 0x00000030
    bin: 00000000 00000000 00000000 00110000
    

    The concept of NULL doesn't exist in a file, but within the generating app's programming language. Just the numeric encoding/value of NULL exists in a file.

    How is it possible that files can contain null bytes in operating systems written in a language with null-terminating strings (namely, C)?

    With the above stated this question becomes, how can a file contain 0? The answer is now trivial.

    For example, if I run this shell code:

    $ printf "Hello\00, World!" 
    test.txt $ xxd test.txt 0000000: 4865
    6c6c 6f00 2c20 576f 726c 6421            Hello., World!
    

    I see a null byte in test.txt (at least in OS X). If C uses null-terminating strings, and OS X is written in C, then how come the file isn't terminated at the null byte, resulting in the file containing Hello instead of Hello\00, World!?

    Is there a fundamental difference between files and strings?

    Assuming an ASCII character encoding (1-byte/8-bit characters in the decimal range of 0 and 127):

    • Strings are buffers/char-arrays of 1 byte characters (where NULL = 0_decimal and '0' = 48_decimal)).
    • Files are sequences of either 32-bit or 64-bit "WORDS" (depends on OS and hardware, ie x86 or x64 respectively).

    Therefore, a 32-bit OS file that contains only ASCII strings will be a sequence of 32-bit (4-byte) words that range between the decimal values 0 and 127, essentially using only the first byte of the 4-byte word (b2: base-2, decimal is base-10 and hex base-16, fyi)

      0_b2: 00000000 00000000 00000000 00000000
     32_b2: 00000000 00000000 00000000 00100000
     64_b2: 00000000 00000000 00000000 01000000
     96_b2: 00000000 00000000 00000000 01100000
    127_b2: 00000000 00000000 00000000 11111111
    128_b2: 00000000 00000000 00000001 00000000
    

    Weather this byte is left-most or right-most depends on the OS's endianness.

    But to answer your question about the missing NULL after Hello\00, World! I'm going to assume that it was substituted by the EOL/EOF (end of file) value, which is most likely non-printable and is why your not seeing it in the output window.

    Note: I'm sure modern OS's (and classic Unix based systems) optimize the storage of ASCII characters, so that 1 word (4 bytes) can pack in 4 characters. Things change with UTF however, since these encodings use more bits to store characters, since they have larger alphabets/character sets to represent (like 50k Kanji/Japanese characters). I think UTF-8 is analogus to ASCII, and renamed for uniformity (with UTF-16 and UTF-32).

    Note: C/C++ does in fact "pack" 4 characters into a single 4-byte word using character arrays (ie, strings). Since each char is 1-byte, the compiler will allocate and treat it as 1-byte, arithmetically, on the stack or heap. So if you declare an array in a function (ie, an auto-variable), like so

    char[] str1[7] = {'H','e','l','l','o','!','\0'};
    

    where the function stack begins at address 1000_b10 (base-10/decimal), then ya have:

    072 101 108 108 111 033
    
    addr  char        binary   decimal
    ----  ----------- -------- -------
    1000: str1[0] 'H' ‭01001000‬ (072)
    1001: str1[1] 'e' ‭01100101‬ (101)
    1002: str1[2] 'l' ‭01101100‬ (108)
    1003: str1[3] 'l' ‭01101100‬ (108)
    1004: str1[4] 'o' ‭01101111‬ (111)
    1005: str1[5] '!' ‭00100001‬ (033)
    1006: str1[6] '0' 00000000 (000)
    

    Since RAM is byte-addressable, every address references a single byte.

提交回复
热议问题