Why is tailcall optimization not performed for types of class MEMORY?

问题

I'm trying to understand the implication of System V AMD64 - ABI for returning by value from a function.

For the following data type

struct Vec3{
    double x, y, z;
};

the type Vec3 is of class MEMORY and thus the following is specified by the ABI concerning "Returning of Values":

If the type has class MEMORY, then the caller provides space for the return value and passes the address of this storage in %rdi as if it were the first argument to the function. In effect, this address becomes a “hidden” first argument. This storage must not overlap any data visible to the callee through other names than this argument.

On return %rax will contain the address that has been passed in by the caller in %rdi.

With this in mind, the following (silly) function:

struct Vec3 create(void);

struct Vec3 use(){
    return create();
}

could be compiled as:

use_v2:
        jmp     create

In my opinion, tailcall-optimization can be performed, as we are assured by the ABI, that create will place in %rdi passed value into %rax register.

However, none of the compilers (gcc, clang, icc) seem to be performing this optimization (here on godbolt). The resulting assembly code saves %rdi on stack only to be able move its value to %rax, for example gcc:

use:
        pushq   %r12
        movq    %rdi, %r12
        call    create
        movq    %r12, %rax
        popq    %r12
        ret

Neither for this minimal, silly function nor for more complicated ones from real life, tailcall-optimization is performed. Which leads me to believe, that I must be missing something, which prohibits it.

Needless to say, for types of class SSE (e.g. only 2 and not 3 doubles), tailcall-optimization is performed (at least by gcc and clang, live on godbolt):

struct Vec2{
    double x, y;
};

struct Vec2 create(void);

struct Vec2 use(){
    return create();
}

results in

use:
        jmp     create

回答1:

Looks like a missed optimization bug that you should report, if there isn't already a duplicate open for gcc and clang.

(It's not rare for both gcc and clang to have the same missed optimization in cases like this; do not assume that something is illegal just because compilers don't do it. The only useful data is when compilers do perform an optimization: it's either a compiler bug or at least some compiler devs decided it was safe according to their interpretation of whatever standards.)

We can see GCC is returning its own incoming arg instead of returning the copy of it that create() will return in RAX. This is the missed optimization that's blocking tailcall optimization.

The ABI requires a function with a MEMORY-type return value to return the "hidden" pointer in RAX¹.

GCC/clang do already realize they can elide actual copying by passing along their own return-value space, instead of allocating fresh space. But to do tailcall optimization, they'd have to realize that they can leave their callee's RAX value in RAX, instead of saving their incoming RDI in a call-preserved register.

If the ABI didn't require returning the hidden pointer in RAX, I expect gcc/clang would have had no problem with passing along the incoming RDI as part of an optimized tailcall.

Generally compilers like to shorten dependency chains; that's probably what's going on here. The compiler doesn't know that the latency from rdi arg to rax result of create() is probably just one mov instruction. Ironically, this could be a pessimization if the callee saves/restores some call-preserved registers (like r12), introducing a store/reload of the return-address pointer. (But that mostly only matters if anything even uses it. I did get some clang code to do so, see below.)

Footnote 1: Returning the pointer sounds like a good idea, but almost invariably the caller already knows where it put the arg in its own stack frame and will just use an addressing mode like 8(%rsp) instead of actually using RAX. At least in compiler-generated code, the RAX return value will typically go unused. (And if necessary, the caller can always save it somewhere themselves.)

As discussed in What prevents the usage of a function argument as hidden pointer? there are serious obstacles to using anything other than space in the caller's stack frame to receive a retval.

Having the pointer in a register just saves an LEA in the caller if the caller wants to store the address somewhere, if it is a static or stack address.

However, this case is close to one where it would be useful. If we're passing along our own retval space to a child function, we might want to modify that space after the call. Then it is useful for easy access to that space, e.g. to modify a return value before we return.

#define T struct Vec3

T use2(){
    T tmp = create();
    tmp.y = 0.0;
    return tmp;
}

Efficient handwritten asm:

use2:
        callq   create
        movq    $0, 8(%rax)
        retq

Actual clang asm at least still uses return-value optimization, vs. GCC9.1 copying. (Godbolt)

# clang -O3
use2:                                   # @use2
        pushq   %rbx
        movq    %rdi, %rbx
        callq   create
        movq    $0, 8(%rbx)
        movq    %rbx, %rax
        popq    %rbx
        retq

This ABI rule perhaps exists specifically for this case, or maybe the ABI designers were picturing that the retval space might be newly-allocated dynamic storage (which the caller would have to save a pointer to if the ABI didn't provide it in RAX). I didn't try that case.

回答2:

System V AMD64 - ABI will return data from a function in registers RDX and RAX or XMM0 and XMM1. Looking at Godbolt the optimization seems to be based on size. The compiler will only return up to 2 double or 4 float in registers.

Compilers miss optimizations all the time. The C language does not have tail-call optimization, unlike Scheme. GCC and Clang have said that they have no plans to try and guarantee tail-call optimization. It sounds like OP could try asking the compiler developers or open a bug with said compilers.

来源：https://stackoverflow.com/questions/57595888/why-is-tailcall-optimization-not-performed-for-types-of-class-memory

标签

gcc

optimization

x86-64

calling-convention