GCC SSE code optimization

后端 未结 2 1169
说谎
说谎 2020-12-08 05:45

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by th

相关标签:
2条回答
  • 2020-12-08 06:02

    I would like to extend chill's answer and draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.

    Just replace the inner loop in chill's sample code with:

    for (i = N-1; i >= 0; --i)
        r[i] = (a[i] + b[i]) * c[i];
    

    GCC (4.8.4) with options -S -O3 -mavx produces:

    .L5:
        vmovsd  a+79992(%rax), %xmm0
        subq    $8, %rax
        vaddsd  b+80000(%rax), %xmm0, %xmm0
        vmulsd  c+80000(%rax), %xmm0, %xmm0
        vmovsd  %xmm0, r+80000(%rax)
        cmpq    $-80000, %rax
        jne     .L5
    
    0 讨论(0)
  • 2020-12-08 06:13

    Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:

    #define N 10000
    #define NTIMES 100000
    
    double a[N] __attribute__ ((aligned (16)));
    double b[N] __attribute__ ((aligned (16)));
    double c[N] __attribute__ ((aligned (16)));
    double r[N] __attribute__ ((aligned (16)));
    
    int
    main (void)
    {
      int i, times;
      for (times = 0; times < NTIMES; times++)
        {
          for (i = 0; i < N; ++i)
            r[i] = (a[i] + b[i]) * c[i];
        }
    
      return 0;
    }
    

    and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:

    .L3:
        movapd  a(%eax), %xmm0
        addpd   b(%eax), %xmm0
        mulpd   c(%eax), %xmm0
        movapd  %xmm0, r(%eax)
        addl    $16, %eax
        cmpl    $80000, %eax
        jne .L3
    

    As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:

    .L3:
        vmovapd a(%eax), %ymm0
        vaddpd  b(%eax), %ymm0, %ymm0
        vmulpd  c(%eax), %ymm0, %ymm0
        vmovapd %ymm0, r(%eax)
        addl    $32, %eax
        cmpl    $80000, %eax
        jne .L3
    

    Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.

    0 讨论(0)
提交回复
热议问题