How does mtune actually work?

孤街浪徒 提交于 2019-12-01 03:21:10

-mtune doesn't create a dispatcher, it doesn't need one: we are already telling the compiler what architecture we are targeting.

From the GCC docs:

-mtune=cpu-type

        Tune to cpu-type everything applicable about the generated code, except for the ABI and the
        set of available instructions.

This means that GCC won't use instructions available only on cpu-type 1 but it will generate code that run optimally on cpu-type.

To understand this last statement is necessary to understand the difference between architecture and micro-architecture.
The architecture implies an ISA (Instruction Set Architecture) and that's not influenced by the -mtune.
The micro-architecture is how the architecture is implemented in hardware. For an equal instruction set (read: architecture), a code sequence may run optimally on a CPU (read micro-architecture) but not on another due to the internal details of the implementation. This can go as far as having a code sequence being optimal only on one micro-architecture.

When generating the machine code often GCC has a degree of freedom in choosing how to order the instructions and what variant to use.
It will use a heuristic to generate a sequence of instructions that run fast on the most common CPUs, sometime it will sacrifice a 100% optimal solution for CPU x if that will penalise CPUs y, z and w.

When we use -mtune=x we are fine tuning the output of GCC for CPU x thereby producing a code that is 100% optimal (from the GCC perspective) on that CPU.

As a concrete example consider how this code is compiled:

float bar(float a[4], float b[4])
{
    for (int i = 0; i < 4; i++)
    {
        a[i] += b[i];
    }

    float r=0;

    for (int i = 0; i < 4; i++)
    {
        r += a[i];
    }

    return r;
} 

The a[i] += b[i]; is vectorised (if the vectors don't overlap) differently when targeting a Skylake or a Core2:

Skylake

    movups  xmm0, XMMWORD PTR [rsi]
    movups  xmm2, XMMWORD PTR [rdi]
    addps   xmm0, xmm2
    movups  XMMWORD PTR [rdi], xmm0
    movss   xmm0, DWORD PTR [rdi] 

Core2

    pxor    xmm0, xmm0
    pxor    xmm1, xmm1
    movlps  xmm0, QWORD PTR [rdi]
    movlps  xmm1, QWORD PTR [rsi]
    movhps  xmm1, QWORD PTR [rsi+8]
    movhps  xmm0, QWORD PTR [rdi+8]
    addps   xmm0, xmm1
    movlps  QWORD PTR [rdi], xmm0
    movhps  QWORD PTR [rdi+8], xmm0
    movss   xmm0, DWORD PTR [rdi]

The main difference is how an xmm register is loaded, on a Core2 it is loaded with two loads using movlps and movhps instead of using a single movups.
The two loads approach is better on a Core2 micro-architecture, if you take a look at the Agner Fog's instructions tables you'll see that movups is decoded into 4 uops and has a latency of 2 cycles while each movXps is 1 uop and 1 cycle of latency.
This is probably due to the fact that 128-bit accesses were split into two 64-bit accesses at the time.
On Skylake the opposite is true: movups performs better than two movXps.

So we have to pick up one.
In general, GCC picks up the first variant because Core2 is an old micro-architecture, but we can override this with -mtune.


1 Instruction set is selected with other switches.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!