What's the reason for letting the semantics of a=a++ be undefined?

后端 未结 8 1522
后悔当初
后悔当初 2020-12-08 07:18
a = a++;

is undefined behaviour in C. The question I am asking is : why?

I mean, I get that it might be hard to provide a

相关标签:
8条回答
  • 2020-12-08 07:49

    Updating the same object twice without an intervening sequence point is Undefined Behaviour because ...

    • because that makes compiler writers happier
    • because it allows implementations to define it anyway
    • because it doesn't force a specific constraint when it isn't needed
    • ...
    0 讨论(0)
  • 2020-12-08 07:52

    UPDATE: This question was the subject of my blog on June 18th, 2012. Thanks for the great question!


    Why? I want to know if this was a design decision and if so, what prompted it?

    You are essentially asking for the minutes of the meeting of the ANSI C design committee, and I don't have those handy. If your question can only be answered definitively by someone who was in the room that day, then you're going to have to find someone who was in that room.

    However, I can answer a broader question:

    What are some of the factors that lead a language design committee to leave the behaviour of a legal program (*) "undefined" or "implementation defined" (**)?

    The first major factor is: are there two existing implementations of the language in the marketplace that disagree on the behaviour of a particular program? If FooCorp's compiler compiles M(A(), B()) as "call A, call B, call M", and BarCorp's compiler compiles it as "call B, call A, call M", and neither is the "obviously correct" behaviour then there is strong incentive to the language design committee to say "you're both right", and make it implementation defined behaviour. Particularly this is the case if FooCorp and BarCorp both have representatives on the committee.

    The next major factor is: does the feature naturally present many different possibilities for implementation? For example, in C# the compiler's analysis of a "query comprehension" expression is specified as "do a syntactic transformation into an equivalent program that does not have query comprehensions, and then analyze that program normally". There is very little freedom for an implementation to do otherwise.

    By contrast, the C# specification says that the foreach loop should be treated as the equivalent while loop inside a try block, but allows the implementation some flexibility. A C# compiler is permitted to say, for example "I know how to implement foreach loop semantics more efficiently over an array" and use the array's indexing feature rather than converting the array to a sequence as the specification suggests it should.

    A third factor is: is the feature so complex that a detailed breakdown of its exact behaviour would be difficult or expensive to specify? The C# specification says very little indeed about how anonymous methods, lambda expressions, expression trees, dynamic calls, iterator blocks and async blocks are to be implemented; it merely describes the desired semantics and some restrictions on behaviour, and leaves the rest up to the implementation.

    A fourth factor is: does the feature impose a high burden on the compiler to analyze? For example, in C# if you have:

    Func<int, int> f1 = (int x)=>x + 1;
    Func<int, int> f2 = (int x)=>x + 1;
    bool b = object.ReferenceEquals(f1, f2);
    

    Suppose we require b to be true. How are you going to determine when two functions are "the same"? Doing an "intensionality" analysis -- do the function bodies have the same content? -- is hard, and doing an "extensionality" analysis -- do the functions have the same results when given the same inputs? -- is even harder. A language specification committee should seek to minimize the number of open research problems that an implementation team has to solve!

    In C# this is therefore left to be implementation-defined; a compiler can choose to make them reference equal or not at its discretion.

    A fifth factor is: does the feature impose a high burden on the runtime environment?

    For example, in C# dereferencing past the end of an array is well-defined; it produces an array-index-was-out-of-bounds exception. This feature can be implemented with a small -- not zero, but small -- cost at runtime. Calling an instance or virtual method with a null receiver is defined as producing a null-was-dereferenced exception; again, this can be implemented with a small, but non-zero cost. The benefit of eliminating the undefined behaviour pays for the small runtime cost.

    A sixth factor is: does making the behaviour defined preclude some major optimization? For example, C# defines the ordering of side effects when observed from the thread that causes the side effects. But the behaviour of a program that observes side effects of one thread from another thread is implementation-defined except for a few "special" side effects. (Like a volatile write, or entering a lock.) If the C# language required that all threads observe the same side effects in the same order then we would have to restrict modern processors from doing their jobs efficiently; modern processors depend on out-of-order execution and sophisticated caching strategies to obtain their high level of performance.

    Those are just a few factors that come to mind; there are of course many, many other factors that language design committees debate before making a feature "implementation defined" or "undefined".

    Now let's return to your specific example.

    The C# language does make that behaviour strictly defined(); the side effect of the increment is observed to happen before the side effect of the assignment. So there cannot be any "well, it's just impossible" argument there, because it is possible to choose a behaviour and stick to it. Nor does this preclude major opportunities for optimizations. And there are not a multiplicity of possible complex implementation strategies.

    My guess, therefore, and I emphasize that this is a guess, is that the C language committee made ordering of side effects into implementation defined behaviour because there were multiple compilers in the marketplace that did it differently, none was clearly "more correct", and the committee was unwilling to tell half of them that they were wrong.


    (*) Or, sometimes, its compiler! But let's ignore that factor.

    (**) "Undefined" behaviour means that the code can do anything, including erasing your hard disk. The compiler is not required to generate code that has any particular behaviour, and not required to tell you that it is generating code with undefined behaviour. "Implementation defined" behaviour means that the compiler author is given considerable freedom in choice of implementation strategy, but is required to pick a strategy, use it consistently, and document that choice.

    () When observed from a single thread, of course.

    0 讨论(0)
  • 2020-12-08 07:55

    The postfix ++ operator returns the value prior to the incrementation. So, at the first step, a gets assigned to its old value (that's what ++ returns). At the next point it is undefined whether the increment or the assignment will take place first, because both operations are applied over the same object (a), and the language says nothing about the order of evaluation of these operators.

    0 讨论(0)
  • 2020-12-08 07:56

    It's ambiguous but not syntactically wrong. What should a be? Both = and ++ have the same "timing." So instead of defining an arbitrary order it was left undefined since either order would be in conflict with one of the two operators definitions.

    0 讨论(0)
  • 2020-12-08 07:59

    Somebody may provide another reason, but from optimization (better say assembler presentation) point of view a need be loaded to CPU register, postfix operator's value should be placed to another register or the same. So last assignment can depend either optimizer use one register or two.

    0 讨论(0)
  • 2020-12-08 07:59

    Suppose a is a pointer with value 0x0001ffff. And suppose the architecture is segmented so that the compiler needs to apply the increment to the high and low parts separately, with a carry between them. The optimiser could conceivably reorder the writes so that the final value stored is 0x0002ffff; that is,the low part before the increment and the high part after the increment.

    This value is twice either value that you might have expected. It may point to memory not owned by the application, or it may (in general) be a trapping representation. In other words, the CPU may raise a hardware fault as soon as this value is loaded into a register, crashing the app. Even if it doesn't cause an immediate crash, it is a profoundly wrong value for the app to be using.

    The same kind of thing can happen with other basic types, and the C language allows even ints to have trapping representations. C tries to allow efficient implementation on a wide range of hardware. Getting efficient code on a segmented machine such as the 8086 is hard. By making this undefined behaviour, a language implementor has a bit more freedom to optimise aggressively. I don't know if it has ever made a performance difference in practice, but evidently the language committee wanted to give every benefit to the optimiser.

    0 讨论(0)
提交回复
热议问题