Example of error caused by UB of incrementing a NULL pointer

后端 未结 4 1774
滥情空心
滥情空心 2020-12-10 07:05

This code :

int *p = nullptr;
p++;

cause undefined behaviour as it was discussed in Is incrementing a null pointer well-defined?

Bu

相关标签:
4条回答
  • 2020-12-10 07:43

    An ideal C implementation would, when not being used for kinds of systems programming that would require using pointers which the programmer knew to have meaning but the compiler did not, ensure that every pointer was either valid or was recognizable as invalid, and would trap any time code either tried to dereference an invalid pointer (including null) or used illegitimate means to created something that wasn't a valid pointer but might be mistaken for one. On most platforms, having generated code enforce such a constraint in all situations would be quite expensive, but guarding against many common erroneous scenarios is much cheaper.

    On many platforms, it is relatively inexpensive to have the compiler generate for *foo=23 code equivalent to if (!foo) NULL_POINTER_TRAP(); else *foo=23;. Even primitive compilers in the 1980s often had an option for that. The usefulness of such trapping may be largely lost, however, if compilers allow a null pointer to be incremented in such a fashion that it is no longer recognizable as a null pointer. Consequently, a good compiler should, when error-trapping is enabled, replace foo++; with foo = (foo ? foo+1 : (NULL_POINTER_TRAP(),0));. Arguably, the real "billion dollar mistake" wasn't inventing null pointers, but lay rather with the fact that some compilers would trap direct null-pointer stores, but would not trap null-pointer arithmetic.

    Given that an ideal compiler would trap on an attempt to increment a null pointer (many compilers fail to do so for reasons of performance rather than semantics), I can see no reason why code should expect such an increment to have meaning. In just about any case where a programmer might expect a compiler to assign a meaning to such a construct [e.g. ((char*)0)+5 yielding a pointer to address 5], it would be better for the programmer to instead use some other construct to form the desired pointer (e.g. ((char*)5)).

    0 讨论(0)
  • 2020-12-10 07:53

    This is just for completion, but the link proposed by @HansPassant in comment really deserves to be cited as an answer.

    All references are here, following is just some extracts

    This article is about a new memory-safe interpretation of the C abstract machine that provides stronger protection to benefit security and debugging ... [Writers] demonstrate that it is possible for a memory-safe implementation of C to support not just the C abstract machine as specified, but a broader interpretation that is still compatible with existing code. By enforcing the model in hardware, our implementation provides memory safety that can be used to provide high-level security properties for C ...

    [Implementation] memory capabilities are represented as the triplet (base, bound, permissions), which is loosely packed into a 256-bit value. Here base provides an offset into a virtual address region, and bound limits the size of the region accessed ... Special capability load and store instructions allow capabilities to be spilled to the stack or stored in data structures, just like pointers ... with the caveat that pointer subtraction is not allowed.

    The addition of permissions allows capabilities to be tokens granting certain rights to the referenced memory. For example, a memory capability may have permissions to read data and capabilities, but not to write them (or just to write data but not capabilities). Attempting any of the operations that is not permitted will cause a trap.

    [The] results confirm that it is possible to retain the strong semantics of a capability-system memory model (which provides non-bypassable memory protection) without sacrificing the advantages of a low-level language.

    (emphasize mine)

    That means that even if it is not an operational compiler, researches exists to build one that could trap on incorrect pointer usages, and have already been published.

    0 讨论(0)
  • 2020-12-10 08:06

    Extracted from http://c-faq.com/null/machexamp.html:

    Q: Seriously, have any actual machines really used nonzero null pointers, or different representations for pointers to different types?

    A: The Prime 50 series used segment 07777, offset 0 for the null pointer, at least for PL/I. Later models used segment 0, offset 0 for null pointers in C, necessitating new instructions such as TCNP (Test C Null Pointer), evidently as a sop to [footnote] all the extant poorly-written C code which made incorrect assumptions. Older, word-addressed Prime machines were also notorious for requiring larger byte pointers (char *'s) than word pointers (int *'s).

    The Eclipse MV series from Data General has three architecturally supported pointer formats (word, byte, and bit pointers), two of which are used by C compilers: byte pointers for char * and void *, and word pointers for everything else. For historical reasons during the evolution of the 32-bit MV line from the 16-bit Nova line, word pointers and byte pointers had the offset, indirection, and ring protection bits in different places in the word. Passing a mismatched pointer format to a function resulted in protection faults. Eventually, the MV C compiler added many compatibility options to try to deal with code that had pointer type mismatch errors.

    Some Honeywell-Bull mainframes use the bit pattern 06000 for (internal) null pointers.

    The CDC Cyber 180 Series has 48-bit pointers consisting of a ring, segment, and offset. Most users (in ring 11) have null pointers of 0xB00000000000. It was common on old CDC ones-complement machines to use an all-one-bits word as a special flag for all kinds of data, including invalid addresses.

    The old HP 3000 series uses a different addressing scheme for byte addresses than for word addresses; like several of the machines above it therefore uses different representations for char * and void * pointers than for other pointers.

    The Symbolics Lisp Machine, a tagged architecture, does not even have conventional numeric pointers; it uses the pair <NIL, 0> (basically a nonexistent <object, offset> handle) as a C null pointer.

    Depending on the ``memory model'' in use, 8086-family processors (PC compatibles) may use 16-bit data pointers and 32-bit function pointers, or vice versa.

    Some 64-bit Cray machines represent int * in the lower 48 bits of a word; char * additionally uses some of the upper 16 bits to indicate a byte address within a word.

    Given that those null pointers have a weird bit pattern representation in the quoted machines, the code you put:

    int *p = nullptr;
    p++;
    

    would not give the value most people would expect (0 + sizeof(*p)).

    Instead you would have a value based on your machine specific nullptr bit pattern (except if the compiler has a special case for null pointer arithmetic but since that is not mandated by the standard you'll most likely face Undefined Behaviour with "visible" concrete effect).

    0 讨论(0)
  • 2020-12-10 08:09

    How about this example:

    int main(int argc, char* argv[])
    {
        int a[] = { 111, 222 };
    
        int *p = (argc > 1) ? &a[0] : nullptr;
        p++;
        p--;
    
        return (p == nullptr);
    }
    

    At face value, this code says: 'If there are any command line arguments, initialise p to point to the first member of a[], otherwise initialise it to null. Then increment it, then decrement it, and tell me if it's null.'

    On the face of it this should return '0' (indicating p is non-null) if we supply a command line argument, and '1' (indicating null) if we don't. Note that at no point do we dereference p, and if we supply an argument then p always points within the bounds of a[].

    Compiling with the command line clang -S --std=c++11 -O2 nulltest.cpp (Cygwin clang 3.5.1) yields the following generated code:

        .text
        .def     main;
        .scl    2;
        .type   32;
        .endef
        .globl  main
        .align  16, 0x90
    main:                                   # @main
    .Ltmp0:
    .seh_proc main
    # BB#0:
        pushq   %rbp
    .Ltmp1:
        .seh_pushreg 5
        movq    %rsp, %rbp
    .Ltmp2:
        .seh_setframe 5, 0
    .Ltmp3:
        .seh_endprologue
        callq   __main
        xorl    %eax, %eax
        popq    %rbp
        retq
    .Leh_func_end0:
    .Ltmp4:
        .seh_endproc
    

    This code says 'return 0'. It doesn't even bother to check the number of command line args.

    (And interestingly, commenting out the decrement has no effect on the generated code.)

    0 讨论(0)
提交回复
热议问题