How does mtune actually work?

前端 未结 1 1488
时光取名叫无心
时光取名叫无心 2021-01-04 08:24

There\'s this related question: GCC: how is march different from mtune?

However, the existing answers don\'t go much further than the GCC manual itself. At most, we

相关标签:
1条回答
  • 2021-01-04 08:49

    -mtune doesn't create a dispatcher, it doesn't need one: we are already telling the compiler what architecture we are targeting.

    From the GCC docs:

    -mtune=cpu-type

            Tune to cpu-type everything applicable about the generated code, except for the ABI and the
            set of available instructions.

    This means that GCC won't use instructions available only on cpu-type 1 but it will generate code that run optimally on cpu-type.

    To understand this last statement is necessary to understand the difference between architecture and micro-architecture.
    The architecture implies an ISA (Instruction Set Architecture) and that's not influenced by the -mtune.
    The micro-architecture is how the architecture is implemented in hardware. For an equal instruction set (read: architecture), a code sequence may run optimally on a CPU (read micro-architecture) but not on another due to the internal details of the implementation. This can go as far as having a code sequence being optimal only on one micro-architecture.

    When generating the machine code often GCC has a degree of freedom in choosing how to order the instructions and what variant to use.
    It will use a heuristic to generate a sequence of instructions that run fast on the most common CPUs, sometime it will sacrifice a 100% optimal solution for CPU x if that will penalise CPUs y, z and w.

    When we use -mtune=x we are fine tuning the output of GCC for CPU x thereby producing a code that is 100% optimal (from the GCC perspective) on that CPU.

    As a concrete example consider how this code is compiled:

    float bar(float a[4], float b[4])
    {
        for (int i = 0; i < 4; i++)
        {
            a[i] += b[i];
        }
    
        float r=0;
    
        for (int i = 0; i < 4; i++)
        {
            r += a[i];
        }
    
        return r;
    } 
    

    The a[i] += b[i]; is vectorised (if the vectors don't overlap) differently when targeting a Skylake or a Core2:

    Skylake

        movups  xmm0, XMMWORD PTR [rsi]
        movups  xmm2, XMMWORD PTR [rdi]
        addps   xmm0, xmm2
        movups  XMMWORD PTR [rdi], xmm0
        movss   xmm0, DWORD PTR [rdi] 
    

    Core2

        pxor    xmm0, xmm0
        pxor    xmm1, xmm1
        movlps  xmm0, QWORD PTR [rdi]
        movlps  xmm1, QWORD PTR [rsi]
        movhps  xmm1, QWORD PTR [rsi+8]
        movhps  xmm0, QWORD PTR [rdi+8]
        addps   xmm0, xmm1
        movlps  QWORD PTR [rdi], xmm0
        movhps  QWORD PTR [rdi+8], xmm0
        movss   xmm0, DWORD PTR [rdi]
    

    The main difference is how an xmm register is loaded, on a Core2 it is loaded with two loads using movlps and movhps instead of using a single movups.
    The two loads approach is better on a Core2 micro-architecture, if you take a look at the Agner Fog's instructions tables you'll see that movups is decoded into 4 uops and has a latency of 2 cycles while each movXps is 1 uop and 1 cycle of latency.
    This is probably due to the fact that 128-bit accesses were split into two 64-bit accesses at the time.
    On Skylake the opposite is true: movups performs better than two movXps.

    So we have to pick up one.
    In general, GCC picks up the first variant because Core2 is an old micro-architecture, but we can override this with -mtune.


    1 Instruction set is selected with other switches.

    0 讨论(0)
提交回复
热议问题